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PREFACE 


This course was given at the Institute of Statistics of 
the Consolidated University of North Carolina in the spring of 
1954: again at Blacksburg, Virginia, at a Southern Regional 
Graduate Summer Session held at the Virginia Polytechnic In— 
stitute in the summer of 1954: and a third time as a post— 
graduate course in the Michaelmas Term of 1954 and the Lent 
Term of 1955 at the London School of Economics. There has 
since beenafairly consistent demand for copies of the lecture 
notes, the original supply of which is now exhausted. It has 
therefore been decided to revise them and to issue them in the 
present form for the general use of students of statistics. 


Multivariate Analysis in statistics is apt to be a 
baffling subject, especially for those students who want to 
use it in solving practical problems but do not possess the time 
or the inclination to plumb the depths of the mathematical 
theory to which it leads. This course was prepared with prac— 
tical applications very much in the foreground. In it I have 
tried to expound the essential concepts and techniques and have 
limited the mathematical treatment as much as possible. In 
the present stage of knowledge this is no loss. The analysis 
of multivariate material requires to an unusual degree that 
peculiar blend of insight and skill in probabilistic interpre— 
tation which characterises the statistician, and for which 
pure mathematics is no substitute. It will, nevertheless, be 
evident that a considerable body of prerequisite knowledge is 
needed to get the most out of the course: mathematics up to 
matrix algebra, three-dimensional co-ordinate geometry and 
beta functions; statistical theory up to the theory of cor- 
relation and regression, the bivariate normal surface, and 
tests of significance based on normal theory. 


I hope that the course may be found of some interest and 
use. Its issue inthis form is experimental and I am indebted 
tothe publishers, Messrs Charles Griffin & Co., forthe will- 
ingness they have shown to explore many possible methods of 
making available at a moderate price work which would be very 


costly to set in print. ` 
M.G. K, 


London, 
August, 1957. 
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COURSE IN 


MULTIVARIATE ANALYSIS 


1. INTRODUCTION 


"multivariate" analysis would include 


practically the whole of statistical theory. Even in so~ 

called univariate problems, such as 'Student's' test of the 
mean, we require the idea of independence (of mean and vari- 
ance in normal samples) and the complementary idea of depen— 


dence; and indeed any sample is a special case of a multiple 
he individuals, as 4 rule, are independent 


1.1 In a general sense 


variate in which t 
and identically distributed. 


"multivariate analysis" is 


1.2 By general consent the term 
We find it easier to say what 


used in a much narrower sense. 
it is not that what it is, and many writers prescribe their 


domain of discussion by enumeration rather than by definition. 
When we look at the whole field, however, we discern two main 


features 
(a) We are concerned with a set of n individuals each 
of which bears the value of p different variates. 
The multivariate character, so to speak, lies in 
the multiplicity of the p variates, not in the 
size of the set ^n. 
(bo) The variates are dependent among themselves so that 


lit off one or more from the others and 


we cannot Sp 
ider it by itself. The variates must be 
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cons 


considered together. 


1.3 Another important characteristic of multivariate analysis 
arises from a divergence of interest between the mathematician 
and the statistician. The natural inclination of the mathe— 
matician is towards generalization. Give him a result for 
one variate and he inquires after the result for two; give him 
that and he inquires after the: result for p. The statistician, 
on the other hand, is continually struggling to reduce the 
dimensions of his problem. In multivariate analysis he usual- 
ly has an embarrassing profusion of variates and his object is 
to make p as small as he can. (He still prefers n as large 
as possible.) Discriminant analysis, for example, tries to 
reduce the problem of distinguishing between multivariate 
populations to the scale of a single variate. 


1.4 We may thus define multivariate analysis as the branch of 
statistical analysis which is concerned with the relationships 
of sets of dependent variates. We shall subdivide the main 
block of the subject into two parts, according to whether we 
are concerned with dependence or interdependence. 


in dependence, one (or more) of the variates is selected 
for us by the conditions of the problem and we require to in- 
vestigate the way in which it depends on the other variates — 
the so-called but badly-named "independent" variates. The 
regression of one variate on others or the variance analysis 
of a set of yields in a factorial experiment are of this type» 


In interdependence we are concerned with the relationship 
of a set of variates among themselves, no one being selected 
as special in the sense of the dependent variate. The analysis 
of functional relationships, correlation and component analysis 
fall into this group. 


1.5 Some parts of this field are covered at a comparatively 
early stage in statistical training: for example, partial 
association, partial correlation, the regression of one scalar 
variate on a set of others, and the analysis of variance. We 
shall not go over such ground again in this course. On the 
other hand, there are some portions which are rarely covered 
in quite advanced courses, such as component analysis and the 
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analysis of functional relationships. The reasons for this 
unbalanced scheme of instruction are partly historical and 
partly due to didactic convenience (and perhaps we ought to 
add, partly to the very severe theoretical problems which 
arise). The present course will attempt to exhibit the sub- 
ject as a connected whole, though, naturally, more time will 
on those branches which are less familiar to the 

Mathematical results will be proved or quoted, as 
but an attempt will be made to explain their 
the student who does not wish to delay 
ails can take the results on trust and 
return to their justification later. The emphasis through- 
out will be on practical applications. Computational methods 
will be taken for granted except (as in the case of component 
analysis) where they are of an unfamiliar type. 


be spent 
student. 
seems convenient, 
significance so that 
over mathematical det 


1.6 To give concreteness to the exposition we list here some 
blems with which multivariate analysis 


types of practical pro 
is concerned. Some, but not all, of these problems will be 


discussed in the sequel. 


(a) Biometrics A number of skulls are dug up on an 
ancient burial ground. They may all come from 
one race or they may be a mixture of two opposing 

and foe having been flung into one 

pit together. An unlimited number of measurements 
can be made on any one skull. What are the best 
urements to take, what is the minimum number we 

e use them to test homogeneity 


races, friend 


meas 
require and how do w 
or heterogeneity in the sample ? 


(b) Education A number of candidates n take an exam- 
ination in p parts and are given a mark for each 
part. What is the best system of arranging the 
candidates in order of merit, and does the notion 
of "order of merit" have any justifiable meaning ? 


A number of n different areas each 
produce yields of p different crops. Can we make 
any comparisons of general productivity between 
areas, and, if so, how ? Which crops are the best 
indicators of such a quality if it exists ? 


(c) Agriculture 


(a) 


(e) 


(f) 


(g) 


(h) 


Sociology The replies given by members of a popu- 
lation to a questionnaire are expected to vary ac— 
cording to their social class. Information is col- 
lected about certain objective properties of a sample, 
e.g. rent, possession of a telephone, type of educa— 
tion. Can an index of social-class be constructed 
from such material ? And how should we test the 
significance of the difference between two samples ? 


Medicine A drug such as cortisone is injected into 
rats, which are subsequently killed and examined. 
Various organs are found to be affected as compared 
with a control group. How many of these effects 
are significant ? Which are the organs most 
affected ? 


Physics A set of plastics are tested for various 
physical properties such as resilience, strength, 
elasticity, ability to withstand abnormal tempera— 
tures. Can we detect in the results any systematic 
effects such as would enable us to predict them from 
known molecular properties of the plastics ? Can 
we then use the results to design better plastics 
for given purposes ? 


Anthropology For a set of Red Indian tribes there 
are recorded a number of items such as whether 4 
tribe has a rain god, whether it uses totems, whether 
it has any agriculture. It is required to produce 
from this material criteria to decide whether a 
tribe belongs or not to certain ethnic groups, Or 

to suggest what such groups might be. 


Each year there is produced for a given 


Economics 
up with its 


country data which are in some way bound 
general business activity, such as national income, 
rate of interest, freight-car loadings, steel pro- 
duction, unemployment, bank clearings and marriage 
Can we produce an index-number of "business 


rate. 5 
and if we can does it 


activity" from this complex, 
have any objective meaning ? 
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(i) Bxperimentation By accident or design a multi- 
factor experiment is conducted with no proper 
balance or orthogonal properties. How do we as— 
sess the significance of the main effects ? 

A firm of tailors making ready-to-wear 

hes to produce enough to cover 

its large clientele with the 


(j) Industry 
suits of clothes wis 


the requirements of 
of misfits and unsold garments. The 


operative measurements are leg length, hip girth, 
trunk length, arm length, chest girth, shoulder 
width and perhaps 8 few others. On what measure— 
ments should it work and how should it proceed to 
mum satisfaction at minimum cost ? 


minimum 


produce maxi 


en down more or less at random, il- 
lication for multivariate ana— 
ned with the methods which have 


These examples, witt 
lustrate the wide field of app. 
This course is concer 


lysis. 
stigate them. 


been devised to inve 


2. COMPONENT ANALYSIS 


2.1 Suppose we have p variates x1 .-- Xpy each observed on 
n individuals. We write x,; for the jth observation on the 
ith variate so that the observations may be arrayed ina 


matrix: 


X44 Ra Oe Xe 


Xps A aa X pn (2.1) 
The object of component analysis is to economize in the num- 
ber of variates. To do this we shall seek for linear trans— 
formations of type 


$ 
E oS OAA, 


i reas Xi, se WeSal) Fly Gia 425 (2.2) 
(We confine ourselves throughout to linear transformations. 
There is no reason why more complicated types should not be 
considered but the theory would become difficult. In prac- 
tice, when the variation is obviously non-linear it is best 
to try to transform to linear variation before embarking on a 


component analysis.) 


2.2 It may be that we can express the data in terms of fewer 
than p of the C's. We have then effected a genuine reduction 
in the dimensions of the problem, the whole complex of varia— 
tions being expressible in m < p variates. But this is excep- 
tional. Where it is not possible we shall try to carry out 

an approximate reduction in this sense: we shall choose the 
coefficients a so that the first of our new variates Č, has as 
large a variance as possible; we shall then choose the second 
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Go so as to be uncorrelated with the first and to have as 
large a variance as possible; and so on. In this way we 
transform to new uncorrelated variates (a useful thing in 
itself) which account for as much of the variation as pos— 
sible in descending order. It may be that the first two or 
three of these variates account for "nearly" the whole of 
the variation, say 85 or 90 per cent, and the contribution of 
the other p — 2 or p — 3 is small. We can then say that the 
variation is represented approximate ly by the first two or 
three variates and in favourable circumstances may be able to 


neglect the remainder. 


Basic theorem on effective dimension-number 


2.3 We consider first of all the case when the number of G's 
is less than f; and we require the following results : 


T= 1, cece Dr J = 1, cece is 
are linearly dependent on m sets 
if we represent the n 


1. If the matrix (x44) 


of rank m all the values 7; 
of them. In geometrical language, 
a Euclidean space of p dimensions by taking x4 .... 


points in 
they will lie in a flat space of m < p 


x, as co-ordinates, 
dimensions. 


ank of the product of a matrix by its transpose 


2. The r: 
matrix. 


is equal to the rank of the 


Now let us take each X; measured about its mean, so that 


n 
2 Xij Ef oy i = 1, 2, oo D (2.3) 
jet 
and also let us standardize so that the x's have unit variance. 
Then 
n 
A A E O E E Rae (2.4) 
n j=1 


Then the p x $ matrix whose t, jth term is 
SE n 
— Z Xinr jr 


cov (X; x;) raah 


mes the product of (x,;,) and its transpose. 


is a factor inn ti 
in n this is the correlation matrix 


Except for a factor 


12 


1 T12 i Go web. 2 r 


1p 
T21 al e T + 4 ws r 
ie) 7? | (2.5) 
Tp Tez The it Be 1 


Then it follows that: if and only if the correlation matrix 
is of rank m < p the variation lies in a linear space of m 

dimensions (and consequently it is possible to find a linear 
transformation to m new variates which completely account for 


the variation). 


Example 2.1 
Consider a four-variate case with matrix 


ak 8 6 6 
8 1 «96 (0) 
6 «96 al =.28 
6 (0) —.28 1 


An examination of the matrix (e.g. the calculation of its de- 
terminants and of the determinants of its three-rowed minors 
shows that it is of rank 2 (p = 4, m=2). It is therefore 
possible to find a linear transformation to two new variates 
Čı and Čz and only two variates are needed to express the`data. 


One set would be given by 


or GRG 

Xo = 068 Ča + 0.6 Č2 
Xe = 0.60, + 0.8% 
xa = 0.60, — 0.8 Č2 


put there are others. 


Example 2-2 
Consider the matrix (f x p) 


EE S S GRN EE a 
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al r PE eg a 
r ak (ag Oto r 
r r i oko. ois 1 


taking out a factor in {i+ - i)r}, 


Adding the rows and 
ow from all other rows, 


and then subtracting r times the unit r 
we see that the determinant of this matrix is 


a-_r)rifit(p-dr}. 


This cannot vanish unless r = torr=i1/(p-—1). Except in 
these special cases the rank of the matrix must then be p and 
hence we cannot represent & set of equally correlated variates 


in fewer than p dimensions. 


st to record a result concerning the 


2.4 It is of some intere 
which relieves us of the necessity 


rank of a symmetric matrix 
if one m-rowed principal minor is 
shes (a) when any one row and 
d to it and (b) when any two 
xed to it, the rank 


for testing every minor: 
not zero and if that minor vani 
the corresponding column is annexe 
rows and the corresponding columns are anne: 


is m. 
There are p rows and & row can be annexed in p — m ways 
and two rows in 4(p - m)(p — m- 1) ways. The number of con- 


ditions on a symmetric matrix for it to be of rank m is then 
(p-m) (p —m+1)- These are, in fact, independent condit- 


ions. (Cf. Lederman, W., 1937, Psychometrika, 2, 85). 


Principal components 


2.5 Consider the n poi 
the x's are expressed ins 
variance). Consider the 1 


ee L P (2.8) 


oo 


ly le fs. ly 


nts in the space of p dimensions when 
tandard measure (zero mean, unit 
ine with current co-ordinates Į 
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where the l's are direction cosines and are therefore sub ject 
to the condition 


p 2 
lay =P. (2.7) 
i=1 
The sum of squares of the distances from the n points on to 
this line is mS, say, given by 


$ 

2 

me 2 |Z om) -{ biem (2.8) 
j=1 [iat i=1 

If this is a stationary value the partial differentials with 


respect to m's vanish anå hence 


=3 p ; 
aee te Faby E = mye Os Fae, Tea he 


j i 
(2.9) 
and since Zeij = O this leads to 
J 
m, 
i 
— = constant. 
i 


i 
Fence the origin lies on the line (2.6) and without loss of 
generality we may take all the m's to be zero. Then, since 


Z x44 n, we have i 
Ges 
n p p A 
m= 2 | 2 xy CE Ly x,y) 
j=1li=1 i=1 
n b 
= nh Se Ly xj,)". (2.10) 


j=1 i=1 
We then find the stationary values of § for variations in l 
subject to (2.7). If A is an undetermined multiplier this 


leads to 


È Ll; Xpy) HNL O k Ae, NO2) 


or the set of p equations 


15 
La- j lo Taz Paa tot, SO 


l,% L ( =À te. ont = 
2 1927 2( ) L; T2 10) 
(2.12) 


li rh + Lo ro RS Paty + Lfi-h)= 0 
If we eliminate the l's we get the so-called characteristic 
equation of the correlation matrix which may be written 


em ives We (2.13) 


For known r this gives us, in general, p roots in A. To each 
root corresponds a set of l's for which S has a stationary 


value. Moreover, by using (2.11) we find from (2.10) 


SSDS We (2.14) 


A cannot be negative. 


Furthermore, it follows from (2.14) that the root which 
gives the minimum S is the one with the largest A. Choosing 
the largest root of (2.13) therefore gives us the line we re— 
quire. The sum of squares of distances of the points from 
it is a minimum and the variate measured along it has the max— 
The variate is given by 

p 
CIS ‘ 2 lij 44 (2.15) 
yen 
where we write lı; with the subscript 1 to indicate that this 
set of l's relate to M.e Multiplying (2.11) by x, and summ- 
ing over k we see that the variance of Č, is M. 


imum variance. 


2.6 It is not immediately obvious, that if we now seek for 
the direction (perpendicular to our line) for which the sum of 
squares of perpendiculars is a minimum we shall arrive at the 
line corresponding to doy the second largest root of (2.18); 
and so on. But such is the fact. Let us prove in the first 
place that if log le; are the l's corresponding to any two 


16 
different roots A, and Xp then the l's are orthogonal, that 


is to say 


b 
a lap lap Oy = oe R (2.16) 


F dj mg To Na lai? 


M 
~ 
I 


> tenmi NB 


Multiply the first by la; and the second by lags sum over ? 
and subtract. We have 


Z o (ei bag Tij -lai tpj Tij) FZ Oa lai lpi- AB lei Pos) 


i j 
Since ri; =1,; the left-hand side vanishes and thus 


(Ay - Ag) 2 las lg = O 


i 


from which (2.16) follows unless ra = dg. 


Thus if ©, «+. Č, correspond to the roots Ayo TNN 
the lines form an orthogonal set. In geometrical terms we 
have rotated our axes of co-ordinates from the x — set to the 
Č set. The matrix (lij) is self-orthogonal. It follows 


that since 


we have also 


E EA (2.17) 


Also 


cov(E,; E) = cov(ž UE Z Lin xa) 
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Lin Lin Ten 


L 


i im “jm 


1i 

> 
M 
~ 


i 


Olanlessp 1 = +7), 


and hence the variates © are statistically uncorrelated. 


2.7 We have thus transformed to new variates © which are un— 
correlated and have variances A Azs ... À, in decreasing 

order. We note that p = DA as is also evident from (2.13); 
for the sum of the roots of a p-ic in A is the sum of p units 


in the main diagonal. 


In particular, if the variates are normally distributed 
we may regard the &'s as splitting off independent components 
of variance \,, Aj; +++ Ap from the total p. 


Note the following points which we state without proof: 
All roots of the characteristic equation (2.10) are 


real and non-negative. This is a property of non- 
negative definite matrices such as the correlation 


(a) 


matrix. 


(b) In degenerate cases certain \'s may be equal; they 
may even be all equal. This is not of much prac- 
tical importance. If it occurs the problem of de— 
termining a unique line or lines minimizing the sum 
of squares is indeterminate and an infinite set will 


satisfy the conditions. 


If certain \'s vanish, say p -m of them, the cor- 


(c) 
relation matrix is of rank m and we are back to the 
case of 2.3 in which the variation collapses into a 
space of m dimensions. The last p-m E's are then 
not required, 

(a) We have standardized the variates by reducing them to 


unit variance before finding the C's. Had we not 
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done so, but left the scales unchanged, we should have found 
different “'s which are not transformable into our set by 
standardization after the new variates are determined. In 
geometrical language, lines of closest fit found by minimiz— 
ing sums of squares of perpendiculars are not invariant under 


change of scale. This point is troublesome when we consider 
sampling problems. 


Example 2.3 


(Kendall, 1939, J. Roy. Statistical Soc., 102, 21.) 

The yields of ten crops were recorded for 48 counties in 
England. (n= 48, $=10). The crops were wheat, barley, 
oats, beans, peas, Potatoes, turnips, mangolds, hay (temporary 
grass) and hay (permanent grass), The correlations between the 
various crops were nearly all positive, suggesting that there 
might be some quality "productivity" associated with an area 
irrespective of the crops actually grown. To allow for cli- 


matic variations four years were chosen; the results for them 
agreed quite closely. 


The correlation matrix was computed and the largest root 
ascertained. For 1925 


» for instance, \= 4.760 and thus the 
corresponding Č, accounts for 47.6 per cent of the total varia- 


tion. The corresponding Ý was given by 


= 0.39 +0. 
fg 0 39%, o 37x, 2 0. 39x, te 0.27% au: 


0.22x + 0.30%, + 0.32%, + 0.26%, + 


0.24, + 0.34x, o. (2.18) 

This variate was provisionally identified with produc— 
tivity and the counties were arranged in order according to 
the magnitude of Č as given by (2.18). They were then group- 
ed into very good, good, moderate, Poor, bad according to the 
the values of Č which they bore. The results agreed with 


general knowledge about the geographical distribution of pro- 
ductivity except in one or two instances, 


In this case a value of N= Ta 76sinot very high and we 
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suspect that another variate, at least, is required to "ex— 
plain" the variation. A more detailed analysis is given by 
Banks, C.H. (1954), J. Roy. Statist. Soc. B. We note that 
all coefficients in (2.18) are positive. 


From one point of view (2.18) may be regarded as deter— 
mining an index Č of productivity. In the ordinary way one 
might expect the crop yields to be weighted in some way ac— 
cording to the production or acreage of the various crops. 

In the 1939 paper Kendall tried alternative methods weighting 
both by value and by starch-equivalent but the results were 


much the same as by the use of (2.18). 


The problem here was to reduce the 10-dimensional varia- 
tion to a one-dimensional variation so that the resulting 
variate might be used as a measure of productivity. The at— 
tempt was partly successful, and as successful as any attempt 
can be which endeavours to express productivity as due to the 
variation of one component only. 


Numerical solution of the characteristic equation 


2.8 The best method of solving the characteristic equation is 
by iteration. It has the advantage that the l's correspond— 
ing to any root are found simultaneously. We will illustrate 
on the matrix of example 2.1, which we already know to contain 


two components only: 


A B 
1 8 6 6 1.0 1.0000 
38 af .96 (0) 0.92 29974 
6 .96 al 0.28 0.76 8632 
6 0 0.28 1 0.44 3368 
3.0 2.76 2.28 1.32 (2.19) 


We add the columns as shown, divide by the largest and write 
vertically as at column A. We multiply the rows of the 
matrix by the figure in the corresponding row in A to obtain 


a new matrix 
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0000 8000 «6000 «6000 
te 5 
7360 «9200 .8832 «0000 
4560 «7296 «7600 — .2128 
por +0000 —.1232 +4400 
Iamo i Im Be eee A400"). 
2.4560 2.4496 2.1200 0.8272 (2.20) 


Again we sum columns, divide by the sum in the same column as 


before and write the resultsat B in (2.19). We repeat the 


process until we get two columns of types A and B identical. 
At that stage we have the l's corresponding to the largest À 
given as proportional to the numbers in the final column. 

Thus, after 12 iterations on (2.19) we find 1.0000, 1.10000, 
1.0000, .2000. (This, as it happens, is exact.) 
proportional to the l's, 
and hence the l's are 


These are 
The sum of their squares is 3.25 


1 
EA CUT RRE ot: 0.2) (2.21) 
v3.25 3 oe 
The corresponding À is given b 


the corresponding terms in the 
dividing by lL, giving 


a 
R= —— {1 + 
v3.25 x 1) (1.1 x 


y multiplying these figures by 
first row of the matrix and 


*8) + (1 x .6) + (.2 x .6)} 
Bal 
v3.25 
a26. 


and the sums are, as a se 


cond approximation f 1 to 
the right-hand side, namely to the l's 2p eer ory ona 


Usi 
values we obtain a third iog bhesesnew 
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process converges. We now prove (a) that it always will con— 
verge if the A's are different and (b) that unless by accident 
we happen to hit on a case where the first trial is exactly a 
root corresponding to a smaller A (a possibility so remote as 
to be negligible) the process will converge to the largest À. 


In fact, consider a line whose direction cosines in the 
original space are G4, n+ + + ay If we transform to the 
t-axis the direction cosines become, say, Beeb Deon by 
related to the a's by 


sns the Wes, 
a; Z lintr 


If the a's relate to an approximation to the direction cosines 
of a principal axis and a to the next approximation determined 


by 
Dray a, = a; 


we shall then have the corresponding relation 


je td je oR n om 
But DNL 3 A 
u > fij jk k”ik Library 2 
2 
and hence p 
eeu + 
= 4 2 
D r, Li, b, 2 lin br \e Calcutta RS / 
WER Ch RO 
or pe 


Hp dpaibl) Ly, = % 


This is equivalent to 


Now if we start with trial a's and construct successive approx— 
imations in the manner described, this is equivalent, in the 


co-ordinate system of 


2y) 
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li = 
of b, 


A, bpe If, moreover, MS 2 IN, odo Wy, Gai ae 


the ratios b!. / bf is less than b, / b, unless b, = 0. 
Under iteration these ratios theréfore approach zero. Hence 


(a) 


(b) 


(c) 


2.10 Having found the lar 
next largest. 


the first component in the following m 


In general the direction cosines with respect to 
the “system tend to zero except for b, and hence 
the process converges to the largest root and the 
corresponding principal axis; 


This fails if b, = 0, in which case the successive 
approximations all give lines perpendicular to the 
major principal axis and the Process converges to 


the largest root in the hyperplane perpendicular to 
that axis; 


If several of the b's, e.g. b mes td 
(corresponding to equal \'s) + the*remi: 
tion cosines tend to zero with 
but for the equal 1! 
further in the sense 
them. 


are equal 
ning direc— 
respect to them; 

s the process does not converge 
of differentiating between 


gest root we proceed to find the 
To do so we "extract" from the original matrix 


anner ; 
Form the matrix À L; l; to get 
8 88 «8 216 
«88 «968 «88 «176 
-8 -88 8 16 
16 «176 16 «032 


Subtract 


(2.22) 


this from the original matrix to get 


«2 —.08 2 244 
—.08 2032 208 176 
=.2 -08 2 —.44 


(2623) 
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Then proceed to apply the same procedure as before to this re- 
sidual matrix. (We justify this procedure below). 


In this instance we find for the sums of columns in (2.23): 
0.36, -0.144, — 0.36, 0.792 and the iteration converges at once 
to l's proportional to 1, -0.4, —1, 2.2 with X=1.4. If we 
go on to extract this from (2.23) we find a vanishing matrix 
and hence we have exhausted the variation; this is confirmed 


by the fact that our two \'s, 2.6 and 1.4 add up to 4, the 
value of p. The second and final component has then l's equal 
to 
1 
— (1, 0.4, -1, 2.2) (2.24) 
v7 
with À = 14. 


components from a matrix 


2.11 The procedure of "extracting" 
d as follows. 


in the manner described may be justifie 


The transformation to new variable Č results in a set of 


variables which are uncorrelated. We have 


O aE D a ET 


cov (Xj, x;) 


iW 


Z Lyi laj E Cy Sy) 


= > 1 2 
Bley ay r,- (2.25) 
If we now fix Č, the covariance of the resulting conditioned 
1 å A g 

Xx; andx, is the sum on the right in (2.25) IEE the first term 
ai ey; A. Thus the covariance after the first component is 
removed i given by the procedure of 2.10, for the original 
matrix (by reason of standardization) is a covariance matrix. 


erate on this residual 
+ so that the diagonal 
determination of 


To find the next component we © 
matrix but we do not re-standardize i 
terms are unity. Geometrically viewed, the ; 
the second component is equivalent to finding the TET) in the 
space perpendicular to the first component, for which the sums 
of squares of distances is a minimum. It is not, perhaps, 
obvious that this will give us the second largest root of the 
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characteristic equation (2.13); but, as we noted in 2.6, the 
A direction is perpendicular to that corresponding to A and 
if the sum of squares is stationary the corresponding direction 
must coincide with the \, direction. The original covariances 
are reduced by terms like lii L,. r, at the extraction of the 
kth factor, as we see from (2.254, 


Acceleration by powering 
2.12 The convergence procedure of 2.8 may be greatly acceler— 


ated by the following device: reverting again to the original 
matrix (2.19), square it to get 


2.3600 2.1760 1.8000 1.0320 
2.1760 2.5616 2.4000 0.2112 
1.8000 2.4000 2.3600 —0. 2000 


1.0320 0.2112 —0.2000 1.4384 
a a 
7.3680 7.3488 6.3600 2.4816 (2.26) 


If we now form the column sums and proceed as before our first 
trial column is 1.0, +9974, .8632, .3368, which is the second 
column of (2.19). We should find that the second trial colum 
of (2.26) would be the fourth of (2.19); and, generally, that 
the convergence is twice as fast. 


Nor need we stop here. If we square (2.26) and operate 
on the resultant 


14.6096 15.2474 13.5120 4.0195 
15.2474 17.1014 15.6864 2.6104 
13.5120 15.6864 14.6096 1.6048 
4.0195 2.6104 1.6048 3.2186 
47.3885 50.6456 45.4128 11.4533 (2.27) 


and the first trial column gives 


1.0 1.0687 0.9583 0.2417 
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which is the same as we should get by four iterations on the 
original matrix. Generally, if we raise the matrix to the 
tth power the convergence is t times as fast. (Convenient 


numbers for practice are t =4 ort = 8. It does not pay to 


power too far because the elements of the matrix then contain 


too many digits.) 


2.13 The reason for this is easily seen. If l; is a trial 


set of l's so that the next set is 


= tf 
boo ARU 


where A is a constant, the next set is 


Ly = pay Tij a Sie a 


= BA LID, tas 
i i j kj ji 
and the second summation on the right is tig squared. 


2.14 Other methods of extracting roots are known, . Burt (1937, 
Brit. J. Bd. Psych. Ts 172) points out that in the powered 
matrix the diagonal elements tend to become proportional to 
the squares of the l's. Aitken (1987, Proc. Roy. Soc. Ed. 

A, 37, 269) proceeds by pivotal condensation. Anyone who has 
much of this kind of arithmetic to do is probably well advised 
Most electronic machines now carry a 


to consult an expert. 
tic roots À and 


programme which will extract the characteris 
the corresponding vectors lije 


2.15 Note one possible source of confusion. As we have de— 
fined the components Z they do not have unit variances. In 
our example the new variates, from (2.21) and (2.23) are 


Oe cae ate +11x, +x, + 0.2%, } (2-28) 
5 v3.25 


o. G 
RE 


{. + 
ix, + 0.4%, +x, + 22x, } (2.29) 
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with variances 2.6 and 1.4 respectively. (We note immediately 
that they are orthogonal, as they must be.) In factor analysis 
it is more usual to standardize the &'s so that they themselves 
have unit variance. Thus we should have two "factors", say 

f and Tas such that ee = V2.6), and a = V(1.4)f,. On sub- 
stitution in (2.28) and (2.29) this gives us 


1 
ty vac Ca aE lal Se ET OA 
= 8440 x, + .3784 x, + .3440 x, + .0688 x, (2.30) 
and 
pis ee {x, - 0.4% -x +2,.2x} 
20d EES a ee mS Wet 
+3194 x, — «1278 x, — 3194 x, + .7028 x, (2.31) 
Example 2.4 


(Stone, 1947, Supp. J. Roy. Statist. Soc. 9, 1) 


Stone took Kuznet's and Barger's data for the U.S.A. com— 
prising, for each of the years 1922~1938, 17 series regarded 
as constituent elements of total national income or outlay, 
e.g. employers! compensation, consumers! perishable goods plus 
producers! durable goods, net public outlay, net increase in 
inventories, dividends, interest, foreign balance and so on. 
He did a principal component analysis on the observations 
taken about their mean and extracted three principal components 
with À = 80.76, 10.59, 6.09 Per cent, accounting for 97.45 per 
cent of the variance. Fvidently these three components account. 
for nearly all the variation and we have thus reduced the effec— 
tive dimensions of variation from 17 to 3. This. illustrates 
the economy in effective dimension number which is the ob ject 
of component analysis to achieve. 


The remarkable feature of Stone's work, however, is that 
he was able to interpret his components. In many cases our 
principal components do not have an identifiable separate 

t 
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existence and are to be regarded as convenient mathematical 
artefacts. In others (the "general intelligence" factor is 
a notorious case) it is arguable whether the components can 
be given any reality. Stone, however, had reason to suppose 
on economic grounds that the variation was mostly accounted 
for by three components (a) total income 7 or some similar 
quantity, (b) rate of change of 1, say Ai and (c) a trend 
term expressing expansion or contraction of the economy, which 
we may take as a linear term in the time t. Moreover, these 
quantities were separately measurable. Stone correlated his 
three principal components, say F4, F2, Fg (in his notation) 
with 1, Ai and t to obtain 


F, Fz Fs 1 Ai t 
E1 al 
F3 (9) a 
ae) o 1 
i .995 —.041  .057 1 
Ai -.056 948 —.124 —.102 1 
t ~.369 =.282 -.836 -.414 -.112 1 


The three underlined figures stand out and it seems very rea— 
sonable to identify F, with 7, Fp with Ai and Fẹ with t. 


The results of the inquiry are to be interpreted cautious— 
ly because there were only 17 observations, but the study is a 
remarkable one. Stone notes that in demand analysis, when 
including a price we ought to include a large number of other 
Prices, This is usually impossible, but we may be able O 
introduce a principal component or two representing the prises 


Complex through an index number. 


The centroid method 


2.16 Psychological workers have developed numerous methods of 


i i i i by the 
Component analysis which avoid the arithmetic required by 
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solution of the characteristic equation. My personal opinion 
is that ther are objectionable and should not be used when they 
can be avoided. So much published work has been based on them, 
however, that some account is necessary. Perhaps they can be 
justified to some extent as giving approximations to the prin- 
cipal component method, but any discussion of their sampling 


properties seems almost beyond the range of reasonable possi— 
bility. 


2.17 The matrix of observations (2.1) can be regarded as defin- 
ing not only n points in p dimensions but p points inn di- 
mensions. Actually these p points, one corresponding to each 
variate, will, together with the origin, define a p-dimension— 
al space embedded in the m—space. We may imagine them as 
radiating from the origin like the spokes of an umbrella. 

Their lengths are proportional to the variances (in our case 


all unity) and the cosines of the angles between them are the 
correlation coefficients. 


2.18 If the correlations T,; are large the angles between the 
vectors are small and they Bunch together. If there is a 

common component we may expect it to go somewhere through the 
"middle" of this bunch. The centroid method formalises this 


idea by supposing that the first component passes through the 
centre of gravity of the points. 


Let us take a set of orthogonal co-ordinate axes in this 
second kind of p-dimensional space. If our original x's have 
been standardized so as to have unit variance the p points in 
this space will be at the ends of p unit vectors radiating 
from the origin. Let the co-ordinates of the ith point in 
the jth dimension be Viz Then 


Sa yz 
Spa 
fo 


$ 
and T = Dy 
k= 


ij aije (2.32) 


The centroid of the p points is then given by Yj say, where 
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DE 
Soe S E 2 
j ret Yij (2.33) 
Suppose we take the axis of the first co-ordinate to go through 
the centroid, as we may without loss of generality. Then 
Y , = O unless k = 1. Hence 
en 
r = Rs hanes a 0S 
Jj Ricard 2 RYJ Yea Y ja (2.24) 


Let us sum the columns in the correlation matrix, giving us 
sum Ay ... A, such that 


Ay tory 7 PI 
and = me ie = ee 
T= ZA aa (bY 4) (2.35) 
This gives u aie 
s ya ETIL 
$ 
Pre, A; 
and Pec oe all (2.36 


i vT vT 


the centroid, has 


Thus the first component, going through 
are easily found from 


ee proportional to Aj/ VT, which 
he correlation matrix. 


Example 2.5 


Consider again the matrix of Example 2.1 : 


1 8 6 6 
.8 1 96 Q 
.6 96 al. -.28 
EO ee 
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We find T = ZA, = 9.36; 


1/ VI’ = .326, 860, 


and hence the first component is proportional to 


1 
= A.x, = .9806 xo + .9021 xo HTAA Xa + .4315 x, . (2.37) 


To make this comparable with the first principal component 
€, we need to standardize it so that the sum of squares of 
coefficients is unity. The sum of squares in (2.37) is 
2.515, 876. On division by the square root of this we get 
for the first centroid component 


+6181 x, + .5686 xə + .4697 x, + .2720 x4 . (2.38) 
as against the first principal component (2.21) 
+5548 x, + .6163 x3 + .5548 x, + .1110 x, . (2.39) 


The variance of the linear function (2.37) is easily seen to be 
ZA. A,r.,/ T which in this case turns out to be 6.427,6¢2. : 
Division of this by 2.516,8&76 gives us, for the variance of 

(2.33), 2.554. This is to be compared with the optimum value 

2.6 for the first principal component. 


It should be noted that the first centroid component is 
equivalent to the first approximation of (2.19). 


2.19 The determination of the first component is comparatively 
simple, but the real trouble lies ahead. If, in the pre- 
vious example, we form the matrix of (2.37) 


.9616 +8846 «7307 +4231 

+8846 +8138 «6722 «3893 
«7307 6722 5553 3216 

-4231 8803 «S216 $.. -1o62 (2440) 
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and subtract from the original matrix, we get a residual 


«0384 —.0846 «1807 +1769 
—. 0846 «1862 «2878 —. 3893 
—. 1307 2878 24447 —.6016 
-1769 —.3893 —.6016 +8138 
«0000 «0001 «0002 —. 0002 (2.41) 


We will prove presently that it is the matrix of (2.37) 
which is to be extracted, not, for example, that of (2.33). 
Taking this temporarily for granted, let us note that we can 
now proceed to extract a second component by the same method 
because all the column sums are zero (within errors of round- 
ing up). And this must always be so. Geometrically, in 
the p-space we have found a vector passing through the centroid, 
and in (2.41) we have, in effect, projected the complex of 
vectors on to a space perpendicular to this vector. The cent- 
roid of the projected points obviously lies at the origin and 
we cannot draw a new vector through this origin and a new 
centroid. 


2.20 We therefore proceed by reflecting one or more of the 
vectors in the origin so as to "break the centroid away" from 
the origin. Unfortunately, this involves an element of sub- 
jective judgement. Several rules have been proposed, e.g. to 
reflect the vectors with the most negative signs in the resi- 
dual matrix. But there is no absolute rule. 


Example 2.6 


In example 2.5 the procedure is fairly clear. The signs 
in (2.41) run 


+ — - + 
= + + = 
= + + = 
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We then "reflect" the vectors corresponding tox and Xg 

This will change the signs of the terms in the second and third 
columns and again those in the second and third rows, leaving 
the terms in both unchanged. The terms in the transformed 
matrix are then all positive and the sum of terms in the 
columns of. (2.39) is 


24306 «9479 1.4648 1.9816 


We find T = 4.8249, 1/VT = .455,256. The second component 
has then coefficients proportional to 


«1960 24315 «6669 «9021. 


But we must now "put the signs back" by changing those of the 


second and third terms. The second component is thus propor— 
tional to 


+1960 x, — .4315 x, — .6669 xg + .9021 X4 (2.42) 


It is readily verified that this is independent of the: factor 
(2.37). 


2.21 The following points may be briefly noted : 


(a) the sum of cross products in (2.37) and (2.40), namely 


(.9806 x .1960) + (.9021 x —.4315) + (.7452 x —. 6669) 


+ (.4315 x .9021) 


does not vanish. The centroid components are not orthogonal 
in the p-space which we considered in arriving at the princi— 
pal components. They are orthogonal in our second p-—space. 
Only the principal components are orthogonal in both. 


(b) If we have to proceed to determine third, fourth 
components, etc., we do not put the signs back until the very 


end of the operation. 


(c) If we are to proceed as far as p components any 
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method of "breaking the centroid" from the origin will do in 
the sense that we get p independent components. But they 

will not in general have any optimal properties in regard to 
variance. l 


2.22 It remains to justify the process of extracting the suc- 
cessive components in centroid analysis. We require a result 
similar to that-of 2.11. 


Suppose that, in some space of dimension f, we have a set 
of m vectors y; , (4 = 1, By. ceils fj = 1, Ry ce Pi u(moteneces= 
sarily of unit fength) and that their covariances are typified 
by eije If we project them perpendicularly to a unit vector t; 
and 
an = p Wot 1 = Ee (2.43) 
> tj J 2 

j=1 
then the covariances of the projected vectors are given by 
ff 
e pea 


- 0; á (2.44) 


Let CX, CY be two vectors of lengths d., d, projected 
perpendicularly to a line CD on to a plane Sienn D and meet- 
ing it in A, B respectively. Then 


AB2 = AD? + DB? — 2AD. DB cos ADB 


= ac? + BC? — 2AC. BC cos ACB 


In conjunction with the relations AD = AC sin ACD, BD = BC 
sin BCD this gives us 


sin ACD sin BCD cos ADB = cos ACB — cos ACD cos BCD. 
(2.45) 
But we also have by definition 
d; d, cos ACB 
AD.DB cos ADB 


i 


~ 
1 


` and e jk 
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From (2.45) we then derive 


1 = ES e 
ene eik d cos ACD d, cos BCC 

and since a, = d; cos ACD, a, = d, cos BCD this results 

in (2.44). 


To justify the process of successive extraction we need 
to note (1) that the reflection of vectors does not affect 
the dimensions of the space in which they lie and (2) that if 
there are m vectors y lying in a space of dimension p with 
covariances a and we form 


k JR 
SSS (2.46) 
WAG eS) 
ye tf 
then there is a unit vector ti lying in the þpþ-space such that 
cov Y; t) = q;. 


For if we form the dispersion matrix of the y's and t 


“jr Fy ] 
(2.47) 


1 
a’, vair 


from the definition of the a's in (2.46) we see that this 
matrix has the same rank as e ;,° This proves that the first 
centroid component lies in the þ—space of variation. When 
we extract it we project the vectors on to a plane orthogonal 
to this component. Any subsequent components are therefore 
uncorrelated with this component. The residual covariance 
matrix by (2.44) is the residual as we have calculated it. 
Reflection moves the centroid away from the origin and so we 
can preceed to the analysis of subsequent components. 
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A ranking approximation to the first component 


2.23 If our object is to arrange the individuals in some sort 
of order according to a principal component (as, for example, 
if we wish to arrange students in order of a putative general 
intelligence as shown by test performances) a very fair ap— 

proximation can often be obtained by simple ranking methods. 


For any variate we rank the n individuals from 1 to n. We 
then add the p ranks for each individual and so arrive at a 
score for him. The ordering of this score will give us a 


ranking of the individuals according to a common component. 


Such a procedure, in fact, maximizes the average Spearman 
correlation between the ranking so reached and the rankings 
according to the p variates (Kendall, 1955, Rank Correlation 
Methods, 7.10). The rank-vector therefore gets as close to 
the p rank—-vectors as it can, so to speak, and will approxi- 
mate to the first centroid component and to the first princi- 


pal component. 


Example 2.7 


In the case referred to in Example 2.3 Kendall ranked the 
48 counties according to the yields for each of the 10 crops, 
summed the ranks and compared the order with that given by 
the first principal component. The agreement was strikingly 
close and a Spearman p between the orders given by the two 


methods was 0.99. 


Stamp (1952, Land for Tomorrow, Bloomington Press, Indiana) 
has recently applied this idea to an agricultural grading of 
certain countries. The average ranks on nine crops (wheat, 
rye, barley, oats, corn, potatoes, sugar beet, beans and peas) 


were as follows 


1934-8 1946 
Belgium 2.2 Belgium 2.3 
Denmark 2.6 Denmark 2.4 
Netherlands 2.9 Netherlands 2.4 
Germany 4.3 New Zealand 4.2 
Britain 4.7 Britain 4.8 


36 


1934-8 1946 
Ireland 4.7 Ireland 5.3 
New Zealand 5.8 Egypt 6.2 
_ Egypt 6.3 Germany 726 
Austria 72 U.S.A. 8.2 
France 9.2 France 9.0 
Japan 10.4 Canada 9.1 
Italy 12.0 Austria 11.2 
U.S.A. 12.0 Chile 11.5 
Canada 12.3 Argentina 12.4 
Spain 12.6 China 12.7 
Chile 12.9 Italy 12.7 
China 13.6 Japan 14.1 
Argentina 14.3 Spain 14.2 
Australia 16.0 India 17.0 
India 17.8 Australia 17.2 


With all its imperfections, this is an interesting table. 
The lowering of the positions occupied by countries which are 
supposed to have lost the war is noticeable; and the rise in 
countries remote from the actual battlefields, New Zealand, 
U.S.A., Canada and Argentina (but not Australia or India) is 
also brought out. 


(== 


Vp, Calcutta d, 
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B RNA 


3. FACTOR ANALYSIS 


3.1 In component analysis we begin with the observations and 
look for components in the hope that we may be able to reduce 
the dimensions of variation and also that our components may, 
in some cases, be given a physical meaning. We work from the 
data toward a hypothetical model. In factor analysis we work 
the other way round; that is to say, we begin with a model 
and require to see whether it agrees with the data and, if so, to 


estimate its parameters. 


3.2 This is a broad distinction but it is often blurred in 
practice because, at different stages of the development of a 
subject, we may be doing both of these operations. For example 
Spearman, working on material given by psychological tests, 

was led to formulate the model of a single g-factor which he 
identified with intelligence. This was working from data to 
But once the model was given it was compared with 
further data. Some of these could not be made to fit and the 
model was modified to include further factors. This model 

was now compared with further observation, and so on. Most 
science progresses in this way and we shall find in the examples 
considered below that it is not always easy to classify them 
into component or factor analyses. This need not worry us. 


They are often both at different stages of the cycle from ex- 
gain. 


model. 


periment to hypothesis and back a 


3.3 We will suppose, as before, that we have a matrix of ob- 
servations x; -, and we wish to consider whether they can arise 


from a situation with a structure given by 


p 
= j =1... 5 
ca J Canty i Ou O 1 p (3.1) 
kel 
To avoid confusion we omit the suffix j relating to the 
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carticular observation on the ith variate. Here the ty are 
factors which may appear in more than one x, S; is a factor 
specific to x, and E; is an error term. The f, s, and £ are 
regarded as having unit variance and assumed to be all inde- 
pendent inter se. 


With this degree of generality the model is underdeter— 
mined. In fact, by a component analysis we can always ex— 
press the x's in terms of f's without invoking specific or 
error terms at all. But we shall often consider the case 
where there are only one, two or three f's which is equival— 
ent to setting up a model where a number of the coefficients 
a are zero. In such a case any residual variation may be 
ascribed to specifics or error terms. 


3.4 We shall also find it necessary to draw some distinctions 
between types of model. 


Let us consider first of all the case where the given x's 
comprise all the data, that is to say we are not regarding f 
them as a samplefrom some larger population. 


(a) We can take a model with p components in which case 
no specifics or errors are involved; 


(b) we can take m factors (m < p) and consider the re— 
sidual as a simple error term 


m 
x, = postin ty * ee (3.2) 


In this form £; does not necessarily have unit 
variance. If we wish to do so we have to insert 
a parameter c; to have 


aea CE (3.3) 


Now "error" here may have two meanings. It may refer 
to errors of observation in the x's or it may be a 
Synoptic way of saying that we have Tun together 
Parts of the model which ought to be kept distinct 
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(or a mixture of both). If we wish to attribute 
any residual variation to x; it makes no difference 
in this model whether we call it £ or s. The only 
distinction arising is between the situation where 
residuals are particular to the x's and where they 
may be entangled because they comprise factors which 
should have been separately included in the model. 


To illustrate the distinction, suppose that a number 
of persons are subjected to tests of some kind in 
àll of which power of comprehension, far and speed, 
f,, affect performance, The ith test x; may then 
be regarded as a weighted combination of ey and fa. 


i. eH) a he Pres 


i 12 2 


But if these tests differ in nature, they may have 
certain individual elements; for example, if the 
ith test relates to arithmetic there may be an 
"arithmetic ability" which also affects performance, 


giving 


= + + 
oh Grit Cea AE S 


and the score on this test by the jth individual is 


The same individual may not always score exactly 
the same on the same test (if we can give it to him 
on more than one occasion); or the score may be 


subject to observational error. We then have 


Kag Fi a5 i2 2j i Sij i Sij 


3.5 Further distinctions arise when we regard the x's as a 


sample. 


(c) Taking first of all the principal component model 
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i, Pr tE; (3.4) 


we may regard the 's as fixed variates, and the €'s 
as errors due to sampling expressing the deviation of 
the particular observed x's from the population values. 
The problem is then to estimate the a's and the o's 
considered as parameters of the population. 


(d) The same may be true if we consider m < p factors 


m 
Sr Sy hin Pn E; (3.5) 
(e) But if we wish to consider specific factors we have 
a model 
n 
= + 
x; oa a, Py B; O; t+ Y; €; (3.6) 


in which € is an error of observation in the x's. 
Apart from this we regard any member x;; as an ob— 
servation on a variate which is composed of two 
parts, a linear sum of $'s and a specific O. We 
regard O as distributed over the population in a 
particular way (usually normally with zero mean and 
unit variance). But we may also regard d in the 
same way. In such a case our object is to estimate 
a, B and Y but not ¢. 


3.6 These distinctions are subtle but important. The student 
who comes tothemfor the first time can pass over them for the 
time being, merely noting their existence. The importance of 


the distinctions will probably be clear after we have considered 
Some examples, 


3.7 Let us revert to the model of (3.1). The f' 
Common factors, If an f occurs in all x's it is called a 


general factor. If it occurs only in certain x's it is 
called a group factor. 


s are called 
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The factor S; is said to be specific to the variate x}. 
The element E; is called the unreliability factor. 


On our assumption concerning the independence and unit 
variance of the factors we have 

(x. x.) = + 
cov (x; x5) AT b; b; cov (Sy s;) +c, c; cov (€;, €,) 


and if we also assume that the specifics and unreliability 
factors are independent this reduces to 


cov (x;, x;) = Zaig ajk’ N (3.7) 


z 2 2 2 TS 
var x, = Diaz, Ori ticu t 7) (3.8) 


Thus the factors b and C appear only in the variances and do 

not affect the covariances. If we now standardize so that 

the variances are unity we have 

Z a 2 2 

= + + 

q z azp 1b, +6; (3.9) 
2 

The quantity ci is called in psychological terminology un- 

reliability. The statistician yould more often call it an 

error variance. The quantity b; is called the specificity. 

The quantity 2 aa is called the comnunality and is usually 

prieten h?. The complement 1-h; is called the uniqueness. 

hi + bY is called the reliability. 


3.8 In many types of scientific inquiry the unreliability 
would be determined from replicated experiments as an error 
variance. This is undoubtedly the best course when it can 
be followed. But in the social sciences replication may be 
impossible; and in psychology it is often difficult owing to 
the memory factor. For this reason 4 good deal of attention 
has been given, especially in psychology, to the adjustment 
or estimation (or guessing) of the communality element. 


3.9 Reverting to the point of view we adopted in Section 1, 
suppose we have ^ observations on f variables %;. From the 
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principal component viewpoint we have 


S = Z var x; -21,, 


PR ove (Cay, 0a) 
i q td 9 AR j k 


and the principal component equations are 


z lij cov (x5, Xia) ie apa LAS (3.10) 


These are in terms of covariances. If we now standardize by 
dividing them by {var x, var x,}* we do not get 


mi Cine Ni lir (3.11) 


In fact, we only get this result if we standardize at the out— 
set. This is another aspect of the point which we mentioned 
in 2.7 (d). 


From the factor viewpoint, if 


Xij à aik Ties 
( ) z aiy SE uah 
cov (Xy x = aera -2 h 5 
Bet kyn ik Im nj kj” mj (3.12) 
and if we replace È fij Faj /n by its expected value 
6, (=1if k=m 
=Oif k#m) 
we have 
cov (xj, x) = 2 aiki (3.13) 


If the model is 


oa a 2 OT ay wy (3.14) 


we have 
cov (Xy X) = 2 aik ap + var E; 651 (8.15 


Now we require to estimate the coefficient a,, and if we 
could operate on a matrix | 2 aik 1p | our ordinary methods 
would apply. But we have only the estimated matrix 

2 5,2), oe 5a) var E; | . This is the same as the former 
matrix except for the principal diagonals, where each term is 
increased by var £,. Thus we would like to have in the main 
diagonal, not È Cay: + var €; but only Ža; 55 that is to say, if 
we are not to bias the estimates of the a's we must remove 
var £ from the diagonal terms. This is equivalent to sub- 
stituting communalities for unity in the diagonals of the 


standardized matrix. 


Likewise, if we apply principal component analysis to 
| 2a, we get biassed results and need 


patti’ Why Oe 
to remove the component var € from the diagonals before carry— 


ing out the analysis. 


The treatment of communalities 


3.10 Where the hy are completely unknown one method of approach 
has been to regard them as being at choice; and in particular 
to assume that they are such as to minimize the number of fac— 


tors. In general this seems to assume on Nature's part a 


much more indulgent behaviour than we have any right to expect, 
but it is interesting to see what happens in such cases. 
ELN 8 . 
The correlation matrix with hy instead of unity in the 


main diagonals has then p quantities at choice. The number 
of independent conditions for it to have rank m is (from 2.4) 


2 (pb —m) (p-—m +1) 
R 


Thus we can reduce a matrix of rank ) to one of rank m if 


ee p-m) (b-m +1) (3.16% 
2 
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or 
p? -p (Qn +1) +m (m-1) 20 
leading to 
eam +1 —- V(em+1) <2p<eamt+i + V(em+ 1) 
(3.17) 

For example the following are some values of p and m 

$ 2 3 4 5 6 8 10 15 

m al al 2 3 3 5 6 10 


Thus for p = 2 and 3 we can always choose the communalities 
so that only one component (general factor) is required; the 
observed correlations are of a particular kind. 


3.11 Spearman's theoren. If there is only one general factor 
the correlations are given by 


Tij Pi 9 ja 
and hence 
Tij / 
— = same for all 1 
vik 


Thus (apart from diagonal terms) the coefficients in any two 
rows or columns of the correlation matrix are proportional. 
To put it another way, for any different a, bA (an Yel 


Taa"ca~ "aca ~ © coe 
these being the so-called tetrad differences. 


Spearman's theorem says that if there is one general fac— 
tor the tetrad differences vanish. This, as we have just 
seen, is easy to prove. The converse says that if the tetrad- 
differences vanish there is only one factor. This is much 
more difficult but under certain general conditions is true. 
(see Camp, 1932, Biometrika, 24, 418.) 
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Estimating communalities 


3.12 Psychologists give a number of recipes for guessing com- 
munalities. One is to take the largest correlation in the 
corresponding column of the correlation matrix; another is to 
take the mean of the correlations in the corresponding column 
— the so-called averoid method. 


Suppose we have, by some means or other, guessed the com— 
munalities. We then perform an analysis of the data and 
arrive at certain factors, fewer than p, being content to 
neglect the rest. We can then use the coefficients occurring 
in these factors to estimate new communalities, iterate and 
proceed until the communalities converge. What this process 
amounts to is that we assume m factors and assume that they 
account for as much as possible of the variance; this de- 
termines the communalities and consequently the "error" vari- 
ances, But we nave not, by a mathematical manoeuvre plus 
assumption, estimated the error variances such as might arise 


in practice. We have only estimated what they would be if the 


number of factors is what we think it, is and the error variances 


are minimal. 


Example 3.1 


Factor Analysis, Chicago 


Holzinger and Harman (1941, 
Gosnell and Schmidt (1936, J. 


University Press) using data by 
Am. Statist. Ass., 31, 507). 


Chicago was divided into 147 election areas in the 1934 
Presidential elections (n = 147). For these areas Holzinger 
and Harman pick out 8 variates (f = 8) from 17 considered 


by Gosnell and Schmidt, as follows 
(1) 4% Democratic vote and Republican vote for Lewis (a 
Democratic candidate). 


(2) Ditto for Roosevelt (also a Democratic candidate). 


(3) Party vote: 
total votes. 


(4) Median rental in dollars. 


% that straight party votes bore to 
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(5) Home ownership: ¢% of total families owning their 
own houses. 


(6) Unemployment: 4% unemployed in 1931 of gainfully 
employed workers over nine years in age. 

(7) Mobility: 4% of total families living more than one 
year at their present address. 


(8) Education: % of population over 17 years of age who 
completed at least ten grades at school. 


Holzinger and Harman assumed a two factor pattern for this 
complex, estimated communalities by averaging correlations and 
did a centroid analysis to obtain the following coefficients 


Variate First factor Second factor 
1 69 —.28 
2 .88 —. 48 
3 87 —.17 
4 —. 88 —. 09 
5 +28 +65 
6 +89 Ol 
is —. 66 —.56 
8 —.96 —.15 


The first factor was found to contribute about 62% of the 


variance and the second about 14%, making about 76% for the 
two. 


The absolute values of these coefficients are not of pri- 
mary importance; we are more concerned with their signs and 
their relative magnitudes. We note that the first factor 
appears strongly in variates 1, 2, 3 and 6 in a positive way 
aia negative in variates 4, 7 and 8. Those areas possessing 
it had more Democrats, greater unemployment, greater mobility 
and less education than the Republican complement, Perhaps 
we should emphasize that this does not prove that democratic 
voting is caused by the other factors. All we are doing is 
1o explain tbe variation in terms of two "factors", the firs? 
being apparently a compound of the features we havs listed. 


ft 
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The second factor is much less important but seems sig— 
nificant. The greatest coefficients are those for home— 
ownership (positive) and votes for Roosevelt and immobility 
(negative). This may be even more tentatively identified 
with a "home—permanency" factor. 


I must leave it to American citizens to judge whether 
these results are reasonable. In any case the methodology 
is interesting and seems capable of development. 


Example 3.2 
(Rhodes, 1937, J. R. Statist. Soc., 100, 18.) 


Rhodes took monthly series for the 48 months July 1931 — 
June 1935 of 13 series contributing to the "Economist" Index 
of Business Activity: e.g. Employment, Coal Consumption, 
Merchandise on Railways, Postal Receipts, Imports of Raw 
Materials, Bank Clearings. He assumed a "general business 
activity" factor and considered other effects as residual, 


giving him, in our notation 


Kop Ni (05 Ty b; (3.19) 


He determined the a's by a least-squares solution of 


cov (Xj, x5) ee as, 


after taking logarithms to give 


log a; + log a; = log cov (x;, x3). (3.20) 
The results were not very unlike those given by the 


"Economist" Index. Note that 


(a) successive values of the series are correlated so 


that the observations are not independent; 


(b) the least-squares solution is therefore somewhat 


heuristic; 
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(c) Rhodes calculated the residuals after extraction of 
the first factor and came to the conclusion that 
they were not independent. This led him to suggest 
the existence of group factors. 


Rather oddly, Rhodes’ pioneer work in this field does not 
seem to have been followed up. This may be due to the fact 
that economists disclaim a knowledge of multivariate methods 
(at least of this kind) and statisticians are nervous about 
the serial correlation effect in component analysis. The sub- 
ject would probably repay further study. 


Example 3.3 
(Burt and Banks, 1941, Ann. Eugen. Lond., 13, 238) 
2,400 male volunteers for the Royal Air Force were divided 
into 8 age groups. The total range of age was 17 — 38 and the 


results for the 8 groups were so similar that consolidated 
figures could be presented and are discussed here. 


An analysis was carried out on "types" of body build. 
The nine variates specified below were available. 


fy f2 fa 
1 Standing height + — z- 
2 Sitting beight ors = En 
3 Arm length iA = F 
4 Leg length E = os 
5 Thigh length + = oy 
6 Abdomen girth + t sy 
7 Hip girth + ay = 
8 Shoulder girth + + = 
9 Weight t ar = 


Six factors were extracted by the centroid method. The 
first three accounted for 55%, 13.5% and 10.1% (78.6% in all) 
of the variance. The remainder were small and of very 
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doubttul significance. 


The foregoing table gives the signs of the coefficients 
of the factors — we have not bothered to write down the actual 
values. The main factor is is positively associated with every 
measurement. This was identified as a general "size" factor. 
(This might mean that there was some quality of an individual 
which determined the size of his body or that there are qualities 
which preserve some rough sort of proportionality in different 


individuals. ) 


Factor 2 has negative signs for measurements on the length 
of the body and positive signs for those on girth and on the 
weight. This was held to support the dichotomy, proposed by 
some anthropometrists, into two types of body build, the lep- 
tosomic or lean type and the pyknic or thickset type. If 
this factor has any reality (e.g. if it could be identified in 
a gene) it would imply the existence of some effect which might 
vary from one end of a range, involving a very spare lean in- 
dividual, to the other end at which the individual was very 
thickset — irrespective of the actual size, which is determined 


by the first factor. 


Factor 3 has positive coefficients for measurements above 
the waist and negative coefficients for those below the waist. 
This suggests a difference between trunk length and leg length. 


The model conjured up by the analysis is one for which the 
measurements of the individual are determined by three inde— 
pendent factors, The first determines how big the individual 
is to be; the second decides whether he is to be leptosomic 
or pyknic or somewhere between the two; the third differentiates 
his trunk and leg lengths. Whether this is a plausible mode). 
Must be left for the anthropometrist or the geneticist. The 
analysis has shown that about 80% of the variation can be ac— 
counted for by three independent components, but it is Eai to 
question whether this is more than a convenient mathematical 


representation. 
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Example 3.4 
(Harper and others, 1950, Brit. J. App. Phys. 1, 1) 


A number of measurements were made on a set of plastics 
with the following results : 


T: f fo 

A. Tensile strength + .86 ESTO ROZ 
B. 4 elongation at break a E + .03 + 310 
C. % elongation under deadload 

in 5 minutes + ,92 + .14 + 430 
D. % recovered of C in 5 minutes aun! cite!) + .08 + .11 
E. B.S. Hardness number + .92 + .27 + .29 
F. % indentation in 24 hours by 

a loaded wire + .81 + 616 + 42 
A. Minus 2° to which lowered before 

breaking in a standard way ws Al = Hey otal 
H. log volume of resistivity + .69 toner?) — .36 
I. Dieletric constant + .68 +.19  — 4.50 
J. Power factor Bry pale? = .58 + .41 


The analysis was done by an estimation of communalities and a 
centroid method. 


Three factors were extracted, accounting respectively for 
52.2, 8.7 and 10.9% of the variance (totalling 71.84). Note 
that the second accounts for less than the third. This can 
happen in a centroid analysis, but not, of course, in a 
principal component analysis. 


The interpretation of these results is a matter of dif- 
ficulty. With the exception of the first factor there seems 
nothing to suggest reality. The first factor itself seems to 
have something to do with molecular cohesion. 
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This paper has one point of theoretical interest, namely 
that the estimated communalities can be judged in the light of 
what is known about the experimental error. 


Example 3.5 


(Buckatzsch, 1947, Population Studies, 1, 229) 


Buckatzsch's study concerned the influence of social con— 
ditions on mortality rates and the latter were taken as depen— 
dent variables. From his results, however, we can consider 
the possibility of expressing the independent variables in 
terms of principal components. 


For 81 County Boroughs (= large towns and cities) in 
England and Wales in the 1931 Census the following were as— 


certained : 


Ža it, 
1. 4% families living in one 
room + - 
2. 4% males unemployed + - 
3. 4 of male population in 
working classes t t 
4. % women engaged in R 
factory occupations + ar 
+ + 


(Oyu 
5. Latitude N of 50 30 


The signs of coefficients are shown for two factors extracted, 


which accounted for about 80% of the variance. (Buckatzsch 
estimated communalities and proceeded by principal components.) 


The results are very tentative. The first factor is 
identified with "social conditions". The second, so far as 
it means anything, seems associated with ne ae 
women in factories, the coefficient of this variate inJ, 


being much larger than for the others. 


4. FUNCTIONAL RELATIONSHIP 


4.1 We have referred in the previous chapter to the importance 
of distinguishing between various kinds of model which are apt 
to become confused in factor analysis. It is equally import- 
ant to distinguish certain types of model in other multivariate 
situations. 


The source of a good deal of the trouble lies in the over— 
facile way in which a statistician gives the name "error" to 
any discrepancy between model and observation and then is misled 
by his own terminology to postulate stochastic behaviour in the 
"error"term. An extreme example may make the point clear. 


4.2 Suppose a physicist begins with the intention of inves- 
tigating the behaviour of a gas under changes in pressure and 


volume. By a few primitive experiments he is led to expect a 
relation of the Boyle type 


log P + log V — log K= 0 (4.1) 


where P is the pressure and V the volume; but he forgets to 
take changes in temperature into account. He may then wish 
to conduct more exact experiments to verify the law more close- 
ly and takes a number of readings of the variables P and V over 
a period of time. If he plots the variables on a logarithmic 
scale he will get points lying nearly on a straight line, and 
the best line he can draw will give him an estimate of the con- 
stant K. The problem arises because the points do not lie 
exactly on a straight line. 


Such a situation, simple as it is, requires at least three 


different techniques of analysis according to the way we approach 
it. 
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(a) 


(b) 


(e) 
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We may regard (4.1) as the true underlying law of 
behaviour and conduct our experiment so that P is 
given what values we choose without error. Any 
discrepancy between observation and hypothesis is 
then due to errors in log V. These are errors of 
observation and the model is (for observed V, say 
v) 


log v = logP — log K tej (4.2) 


We now suppose that € is a random variable or 
variate and have a simple regression model. Con- 
versely we might arrange the experiment so that V 
was determined without error and p observed, in 
which case we should have a different regression 


model. 


log p = -log V + log K +€. (4.3) 


It is not obvious that the same methods of estimation 
applied to (4-2) and (4.3) would lead to the same 


estimate of K. 


As a second model we may regard (4.1) as the true 
law of behaviour but contemplate errors of obser— 
vation in both variables. The model now is that 


we observe p, V, given by 


p =Pte 


1i 


v Vv +1) (4.4) 


and P, V are related by (4.1). We now have a 


completely different situation. 

Again, we may suspect that (4.1) is not the right 
model and that we have left out a part of the true 
model (as in this case we have, namely the tempera— 
ture T). We do not quite know what we have omitted, 
put we hope that it is not very important. We can 
now suppose, if we like, that P and V are observed 
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without error, but we shall have for the model 
log P + log V—log K = € (4.5) 


where Č stands for "something—left—out-which—we—piously— 
hope—is—not—very—important". If we like, we can assume 
that this behaves like a random variable, in which case 
the model (4.5) becomes formally equivalent to either 
(4.2) or (4.3), whichever we prefer. But obviously 

we are making an enormous assumption here, very unlike 
the more customary assumption concerning errors of obser- 
vation, which we know on empirical grounds to behave 
stochastically. The statistician makes this assump- 
tion very often, more often perhaps than he realises. 


4.3 I assume that the reader is already acquainted with the 
standard regression theory required to deal with situations 
of type (4.2) where there are several independent variables. 
(For a more detailed discussion see the expository articles 
on "Regression, Structure and Functional Relationship", Bio- 
metrika, 1951, 38, 11 and 1952, 39, 96). In this chapter we 
shall be concerned with models of the second kind where both 
variables are subject to errors of observation. 


4.4 We consider the case where two variables U and V are re- 
lated by a linear function 


We iin thoes, (4.6) 


These variables are not directly observed. In fact if d and 
e are random variables we observe 


R 
u 


U U+d (4.7) 
r (4.8) 
Throughout we take d and e to be independent. 


Substituting in (4.6) we have 


1 
YS A the = 0s TO 
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a+B(x'-d)t+e 


a+r -Bite (4.9) 


: 1 1 
Now for this to be treated as a regression of y or x we 


should require the "residual" (- Bd + e) to be distributed 
independently of x . But from (4.7) x and d are not in- 
dependent and ordinary regression theory does not apply. 


4.5 We shall discuss four cases : 


(1) Berkson's case :, the residual (- Bd + e) is 
independent of x . 

(2) Geary's case : U is a random variable but is not 

normal. 


(3) The classical case 1 : U is nota random variable. 


(4) The classical case 2: Uis a random normal variable. 
It appears that the problem of estimating a and B is 

soluble in cases (1) and (2) but not in (3) and (4) without 

some further restriction among the conditions of the problem. 


Berkson's case 


4.6 It is convenient to consider Berkson's case first because 
it can be referred back to the regression model. Beginning 
effectively from (4,9) Berkson considers a situation where d 

is independent of ka by introducing the idea of a controlied 
Variable. (If x is a ‘fixed! variable we mean that d is 
mathematically independent; ifx isa variate d is statis— 
tically independent.) 

in which we are testing Boyle's law 
by adjusting pressures and then measuring the corresponding 
Volumes. We decide, shall we say, to adjust the pressure to 


28 lbs per square inch and do so as far as our apparatus ana 
Hic em a O an We shall commit an error in calling 


Consider an experiment 
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this 28 lbs per square inch but we ignore this error and pro- 
ceed as if 28 is the correct value. We require to assume 
only that positive and negative errors are equally likely. 
This curious and subtle manoeuvre alters the model of (4.9). 
In fact if we call this observed and controlled variable X the 
‘true! value,u is a random variable connected with X by the 
relation 


u = Xtd 
or better perhaps 

X ui di. (4.10) 
The linear relation (4.6), with (4.8) and (4.10) then becomes 


1 


X 


OB (X Fd) tre 


Mt 


a +BX+ Pdte. (4.11) 


Now this is an ordinary regression model with a 'fixed' variable 
X and a residual Bd + e which is independent of X. The 


standard theory applies and we have unbiased estimators of 0 
and B given by 


a = y (4.12) 
Zy x 

b = (4.13) 
a 


where X is measured about its mean. The usual t-test for 


significance applies under assumptions of normality; we shall 
see that this ceases to be true in non-linear cases. 


4.7 Those who refer to Berkson's paper (1950, J. Am. Statist. 
Assoc., 45, 164) should also consult Lindley (1953, Biometrika, 
40, 47). Berkson was ostensibly discussing the question whether 
there are two regression lines in a bivariate situation, There 
certainly are, and Berkson's negative answer really relates to 
the problem of functional relationship. 


Non-linear Berkson case (Geary's extension) 


4.8 Geary (1953, J. Am. Statist. Ass. 48, 94) has extended 
Berkson's argument to the non-linear case. 


Suppose that X is observed at m points and, since the 


scale is at choice, we arrange it so that odd moments of the 
X's vanish, i.e. 


nm 
eX = 0; k = A, Ry eese (Am4) 


Then if E relates to expectations over repetitions of the 
experiment with the same X's we find, from (4.11) 


Ea ak ERAR 
TERO Ly Soe 
ies G y; Assia wee 
mOUo,, k O 1; ces 
(4.15) 
2k S 2k 
22; E(y;) = B 2x; 
m Bhp k = 1, 2, eee 
(4.16) 


The simplest solution is given by k =O in (4.15) and k = 1 
in (4.16), leading to 


ma = ZE(y;) (4.17) 
mBL, = ZX,E(y;), (4.18) 


and consistent estimators of œ and B are then 


DET (4.19) 
m 
To TAS Dxy. (4.20) 


mi, 
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These are also maximum likelihood estimators if the errors d 
and e are normal and in any case (for zero means in d and e) 


are Gauss-Markoff estimators. This confirms our previous 
analysis of Berkson's case. 


4.9 Consider now the case when the underlying relationship is 
non-linear, say, cubic of the form 


v = a+ But yu? + unk (4.21) 


We standardize as in (4.10) and then have 


1 
y =a+tBX-d)+yX-d)? +56% -d) +e 
(4.22) 
Multiplying by 1, X, X?, X? and summing over 1 = 1 se. m 
we have 
Wn EF Yu, (4.23) 
Ui) qe ea bu, (4.24) 
Wee Fu, us NAY (4.25) 
NaS Ny, + du, (4.26) 
where 
=- 1 k 
Ya = E PRN) (4.27) 
m 
€ = a+yvard (4.28) 
Nn = B+ gdvard (4.29) 
We then find for Y and 6 
—_— 2 = 
(Hy, = HB) Y Vinge iron (4.30) 
— u2 = 
(HH, — UE) ô E = Wy Yan (4.31) 


This will give us consistent 


estimators for y and 6 in terms 
of the observables. | 14 
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4.10 But there are two remarkable features about this situa- 
tion. The first is that, although we can estimate Y and ô, 
we cannot estimate Œœ and B. The quantities which occur in 
equations like (4.23) — (4.26), however manyof them we take, 
are Č and N, which we can accordingly estimate. But from 
(4.28) and (4.29) we see that we cannot estimate Œ and B with— 
out a knowledge of the nuisance parameter var d. The only way 
round this difficulty so far suggested (by Geary himself) is 
to replicate the experiment. Since the whole analysis is 
applicable mainly to experimental situations this is usually 
possible, 


4.11 The second feature is that no ordinary test of signifi- 
cance is applicable. In fact the "error variance" is no 
longer (— Rd + e) but varies from one value of X to another. 
The usual tests dependent on variance—ratios such as SHEE Eg 
t do not apply. The problem of testing significance in this 
Case has not, apparently, been solved. 


Geary's case 


4.12 In the case we now consider the random variables u and v 


are connected by 


v =atPhu (4.32) 
and there are errors of observation in u and V : 
x = utd (4.33) 
r (4.34) 
= +e 
y v 


qa to make the equation homo— 


We iate to 
may attach a dummy var l form in p variates as 


geneous and write the more genera 


(4.35) 
ZB; = © 


with 
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1 

If parameter 9;relates to x; we have that the cumulant 

generating function of the x; is the sum of the c.g.f.'s of 
ui and d; and on expansion 


$. b. 

ih Beers 
oa Po see È (cl) aime es Gl = the sum of similar 
rh i ; A T expressions in the 


cumulants of u; and 
d;. 


On the right the terms involving the cumulants of d do not 
contain any product terms, for all the d's are independent, 
and only powers of individual 6's can appear. Thus we have 


1 = 
Taia k N e E. 


so long as there are at least two p's involved. Thus we can 
take cumulants calculated from observation as estimators of the i 
cumulants of the u's, l 


(u) (4.36) 


Now from (4.35) we see that the characteristic function of 
2B iu, is unity and thus 


H 
N 


J Exp (O2Bju,) dP (u,) ... dP (u) 


$ 
$ (0 Pi 8 Ba» eee O Bp)» say. (4.38) 


(i 


From the property of homogeneous functions we then have 


z6; Se 
t 


giving 


= 4.39) 
2B, Lee bos eee (4-4)? peaks Peta ago by © r 


% A ' 
This gives us a set of linear equations in the B's and the K S 
which, in virtue of (4.38), are calculable from the observa- 
tions. Hence we may estimate the B. 


Unfortunately, this method fails to give a result in the 
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case when we most need it, namely when the variation is normal; 
for then cumulants of order higher than two all vanish and the 
equations (4.39) are empty except for a few in the cumulants K 
which are not enough to determine the B's. 


11 


4.13 Moreover, in cases where the distribution of the u's is 
nearly normal the values of K's observed will often have a 

high sampling error compared with the true values, and estimates 
of the B's given by (4.39) may therefore be rather wide of the 
mark. It is somewhat strange that normality, which we often 
invoke to make a method work at all, should stultify this par- 
ticular method. There appears, however, to be some fundamental 
connection between normality and indeterminacy in this and re— 
lated subjects which has not been fully brought to light. 


Classical case 1 
4.14 We now revert to the model 
v=a+uU 


where V, U are not random variables and each is subject to 
error: 


x = Utd 


1 
y = Vte 


If d, e are normal variates the likelihood is proportional to 


al 

= Exp {- 1 2) ep ti cle ee oe 

ôr Y 262 zy -m ôr 267 ; 
2 2 


where ô?, 62 are the variances of d, e, and are supposed the 


same for San observation. Thus, except for a constant, 


EEN PENA ss -a—-Bu)2 
- n tog ô, = a Bie" u) =a 20 a-BU) 


log L =- n log ô, 2 
4 


(4.40) 
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If we now maximise for the n + 4 parameters 6,5 6,5 a; B; U; 
we find 


re) -n 2 1 
= = Z + (x) — 2) = o 
3, ô, 268 4 
giving 
6 ES T (4.41) 
n t 
Similarly 
1 1 
b= i - a - Bu)” (4.42) 
n 


Differentiating for Ui» &, B, gives us 


1 , al, fi n 

paT +s 9-a- Bu B o 
2 2 (4.43) 
Icy -a-Bu = 0 (4.44) 
Luly! -a- BU) = 0 (4.45) 


Now summing (4.43) and using (4.45) we have 
Bx =0) = 0 (ae) 


Also from (4.41) and (4.42), using (4.43), we find 


B262 = êz (4.47) 


This is unacceptable, for we have no reason to suppose that the | 
error variances of d and e are proportional to Bas The basic 
reason for the breakdown of the maximum likelihood method seems 

to be that we are trying to estimate too many parameters. There 

are here n + 4 parameters against only n observations. 


4.15 It appears that no progress can be made in this direction 
unless we make some assumption about the error variances. We 
shall assume that their ratio is known : 
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RIE = IN (4.48) 


The maximum likelihood equations are now 


en aE A ` 
log L = constant — - + = Dex! - U)? + = Ly! - a — fu)? 
6, 8 6l 


(4.49) 
x, -U,+AB oy =a- BT) = 0 (4.50) 
Zy -a-ß0 = o0 (4.51) 
Zu (y’-a-Bu) = o (4.52) 
Without loss of generality take x =) 0} y = O. Then from 
(4.51) 
a+BuU = o 
and from (4.50) 
3+AB (a+BU) = o 
Hence U= fo) (4.53) 
a Sy 40) (4.54) 
and for the estimator of B from (4.50) we find 
U s gular (4.55) 
i 1 + ÀB? 
and hence 
Zy -Bx) œ +ABy’) = 0 


A 


NB Zy 2 + (1-22) Bx'y’ -BEx'2 = 0 (4,56) 
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This gives us a formula for the estimation of B 


Gaye EP NE ax 2) ey =O (4.57) 
Tf Zy- = Bx’? 
a a a (4.58) 
Bax y 
then 
B = tv = +e) (4:59) 


where we take the positive root to maximise the likelihood. 


4.16 We notice that if À = O (corresponding to no error 
in U) the estimator of B from (4.56) is 


i 
8 Ixy 
= —— 
Dies i 
and, in fact, we are back at a regression situation. Similarly, 
if \ is infinite we have a regression of U on V. For values 
of À between O and infinity the line V -œ -BU = O lies 
between these two extremes. In particular if A = 1 the 
line takes the direction of the first principal component. It 
was this line which Frisch called "diagonal regression". 


t.17 If U and V are measured about their means the estimator 
of @ is zero and 


N E ESANA os EENE (4.60) 
Thus 
E (2x2) = Bue + — 62 
E (xy = JUV 
neos 


u (Dy% = DV + =s 
n À 


oe 
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Hence O of (4.58) converges in probability to 


n-167 1 n-1 1 
Dive 4 tS = (02 G2 ee e 
ine x (2 z ) 4 B x 
oi te oS) 
22uUV 2B 


a 


and B from (4.59) then becomes 


} = B (4.61) 


Thus B is a consistent estimator of B. 


4.18 But even here there are still peculiarities in the situa- 
tion. Consider what happens if we estimate 52. From (4.49) 


we have 


§2 = = (2G) -U +hE y-a- pU) 


a 


with œ = O and using (4.50) we have 
ge a 24+) ae -0r (4.62) 
Sai on 1 2p? G z 


Using (4.50) again we have 


62 = (1 +ABR) Ey - BU)? 


(1 + AB2) By’ (y - BU) 


1 Bæ +dBy’) 


a +82) By’ fy } 
1 + XG? 


2 y (y' - Bx’) 


Rl > i> 8l kl > 


(4.63) 
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Hence 
2 A 
2n ô? = By’? 4 BZ xy" 
A 
and 
A ak 1 l2 1 1 1 1 2 
62 = =A {> dy + Zx ?-= [(- dy 2- > dx?) 
2A TS 2n “4 2\n le 7 
è 
t> zxy ]} 
at ai 1 B 
ea Nyt 02 A e Sve) 
2 n A 
a 
= Ae (4.64) 


and thus Ô? is not a consistent estimator of 5?. We need to 
multiply it by 2 to get such an estimator. 


4.19 A more comprehensive treatment of the classical case 1 
has been given by Lindley (1947, Supp. J. Roy. Statist. Soc. 
9, 218), who also shows that if À is known the same equations 
of estimation arise when u and v are jointly normally distri- 
buted — our so-called Case 2. We shall not pursue the matter 
further here except to point out that unbiased estimators of 
functional parameters may be obtained in certain cases by dis- 
tribution—free methods. 


4.20 Suppose that we have a set of observations x', y and 
that they can be divided into two groups such that the values 
of v for one group are all greater than the values of v for 
the other. (This jis not the same as dividing them according 
to the values of y , which of course is a trivial matter). We 
can, to put it broadly, then determine a typical point, such 
as a centre of gravity, for each group and make our linear 
relation pass through the two points. A fortiori, if we can 
divide into more than two groups we can fit a relationship of 
greater accuracy; the extreme case (Theil, 1950, Indagationes 
Mathenaticae, 12, No. 2) occurs when the order of the points 
with regard to one variable, say y , is the same as the order 


ty- 
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with regard to the underlying v, in which case we can determine 
a distribution—free regression line. 


4.21 The assumption about division into groups can often be 
made when the errors of observation are small compared to the 
intervals between the variables u and v. Logically, the sub- 
ject offers more of a problem. Wald (1940, Ann. Math. Statist. 
11, 284),who first discussed the division into two groups, gave 
a condition for the validity of the procedure but unfortunately 
it is not obeyed by normal variables u and v. 


4.22 We shall have to leave the subject at this point, but it 
is to be noted that many important questions arising in con— 
nection with the analysis of functional relationships remain 
unsolved. For some work in this field reference may be made 
to Neyman, J. and Scott, E.Le (1948) Fconometrica, 16, 1 and 
to Creasy, M.A. (1956), Jour. Roy. Statist. Soc., B} 18, 65. 
See also Brown, Re Le (1957), Biometrika, 44, 84, 


5. CANONICAL ANALYSIS 


5.1 Let us recapitulate some of the basic results of ordinary 
regression theory. We suppose that we have a number of ob- 
servations on each of p variables X, +.. X, which are not 
necessarily random variables (or, in our terminology, variates). 
There are certain unknown constants Bos B,, 4,06 B which it is 


our object to estimate. The model we consider is that an ob— 
served variate, y, is given by 


y o= By + BX, +... + BA, +E Be 


where € is a random variable. Without loss of generality we 
may suppose it to have zero mean (for any constant can be ab- 
sorbed into Bo). For expository simplicity we can take a dummy 


variable Xo attached to Bo, having always the value unity and 
so wite 


2 (5.2) 
Of aig 53 BX; +e . 


JEO 
We also assume as part of the model that € has the same variance 
for any set of X's whatsoever. The X's may be functionally 


dependent,so that this model includes the case of a curvilinear 
regression line. 


5.2 The solution to the problem of estimation is well known. 
Without appeal to normality we can derive unbiassed estimators 
of minimal variance from the Gauss—Markoff theorem in least 
Squares; and if in addition we postulate normality the estima— 
tors are maximum likelihood estimators. 


5.3 The principal point to notice about this model is that it 
is not multivariate at all. It is multivariable in the eae, 
that there are p variables X, but they are not random variables: 
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ee, 
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The only variates in the situation are € and the equivalent y. 
Thus multiple regression is not multivariate. We could make it 
so by relaxing our restrictions on € so that, for example, dif- 
ferent X's gave different variances to €, or that successive 
values of € were not independent. But the simple standard 
situation is univariate. 

5.4 A wide class of problems in statistics can be expressed 
in regression form. Apart from ordinary regression theory 
itself we may, for instance, take dummy variables for our X's, 
allowing them to be unity if the observation falls into a given 
class and zero elsewhere, to get the customary models of vari- 


ance analysis in experimental design. 


5.5 What has caused some confusion in the past is that since 
the X's can be chosen how we like, they may in particular be 
the values of random variables. If we choose numbers at ran— 
dom from a (p + 1) multivariate population we can consider 


relations of type 
7 B (5.3) 
= DRTE 5.3 
y DA VEA 


which looks like a linear functional relation between y, the 
x's and €. It is, in fact, a relation of the functional kind, 
but it is still asymmetric in the sense that the x's are sup— 
posed free from error whereas y is subject to the error €. 

For this reason we get 3 different regression line if we pick 
out one of the other variates X as the independent variate 


subject to error. The model is distinct from 


y- 2 Bjxj = € (5.4) 


ree from error and € is an unobser— 


wher d the x's are f 
mine fection in the model; and 


vable residual term expressing imper 
from 


pa eo (5.5) 


s free from error and € is now 


wh are variate 
ere y, x and € tion before but it will 


observable. We have made this distinc 


70 


do no harm to make it again. 


5.6 When the X's are powers of a single variable X it is 
possible to simplify the exposition and analysis of a regres— 
sion model by choosing our variables to be orthogonal poly— 
nomials. (The ultimate results, of course, are the same.) 
This is particularly useful when the values of X are equi- 
distant, in which case many of the quantities required can be 
(and have been) tabulated once and for all. 


It is equally possible to "orthogonalize" a regression 
situation for any arbitrary set of variables X. This pos- 
sibility does not seem to have been much discussed in the 
literature. But it throws some new light on certain old but 
unsolved problems; particularly (a) how many variables do we 
take ? (b) how do we discard the unimportant ones ? and (c) 
how do we get rid of multicollinearities in them ? 


5.7 In fact, let us suppose that we do a principal component 
analysis on the variables x, ees X, and arrive at new compo— 
nents g Gog Tp. These will give us 


$ 

7 ie. j &; ec; (G8) 
where the @'s are linear functions of the B's. On our model 
the Gauss-Markoff theorem applies to give us estimators of 
the d's which are the same linear functions of the estimators 
of the B's, We therefore lose nothing by the transformation, 
except the time spent on the arithmetical labour of finding it. 
But our €'s are now all orthogonal and we have at once 


zyG; 
ze 


Gots (5.77) 


J 


the summation extending over the sample; and this is easily 
converted into sums of type Lyx, which are already known. 
Further, the reduction in variance due to the fitting of C; 
is œ À, where À, is the characteristic root correspond— 
ing to Č.. We can thus see how important each contribution 
is and decide whether any C's can be discarded as unimportant. 
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This in turn will allow us to see how far the X's are import— 
ante 


We can also apply the "Student" t-test to the "signifi— 
cance" of the @'s; and this will also apply even when the 
original x's are variate values chosen by a random process. 


Example 5.1 
(Stone, J.R.N., 1945, J. Roy. Statist.Soc., 108, 308,) 


In some studies of demand analysis Stone considered the 
consumption of beer in the United Kingdom for the years 1920- 
1938 inclusive. He was interested in an equation of type 


log q =a +b log ate log p +d log N+f logg+rt loge 
(5.8) 


where = consumption (bulk barrels) 


= reel income 


retail price 


a s+ D@ 
" 


= cost-of-living index 


gravity of the beer 


Gr EG) 
n 


= time 


= the constant 2.718 ... 


QJ 
if 


The constants a, b, c, d, f, r are under estimate. Calling 


the logs of Q, p, T, g, t respectively X,, X, +++ Xp we have 


the following correlation matrix (Stone's figures): 


at —.610375 —. 660691 —.507697 2918651 
T aara 1250201 SARO 

‘i. .397888  —.831054 

le ~.649439 


1. 
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A principal component analysis on this matrix gives the follow- 
ing 


A 3.2470 1.2753 .3859 .0700 +0218 
XA —.5215 ~.O711 +4730 +5219 —. 4757 
X2 +3121 +7090 —.2161 5942 —.0112 
Xa 24758 -0381 +8246 +0103 +3050 
Xy +3245 —. 6943 =. 2224 «5801 .1629 
XE —.5471 .0935 .0105 «1945 +8087 


Here, for example, ©,= —.5215 X, + .3121 X, + .4573 Xa + 
23245 X4 — .5471 X5. If y = log q we also have for the cor— 
relations with the X's respectively —.457,536, .031,719, 
«899,223, .601,129, ~-710,155. We will take the variables as 


standardized so that these correlations are also their covari- 
ances, 


From the small size of À, and Às we seem safe in neglect- 
ing the contributions from these sources. We then have 


Zy č 


1 
wa a ea 


Ay 3.2470 


(~.5215) (-.4575) + (.3121) (.0317) 


+ (.4753) (.8992) + (.3245)(.6011) — (25471) (--7102) 


= .3879 


Likewise 
M% = —.3093 


Qa = 9772 


On our scale the variance of y is unity and the contribution of 
the term in C; is Ody, = dj. For our first three fac— 
tors these contributions are 24885, .1220 and .3685, totalling 
+9790; they account for about 98 per cent of the variance. 
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We may now go back, if we wish, to express the regression 
equation in terms of the standardized variates X. We have 


log q (about its mean) = .3879 (—.5215 X4 + .3121 Xp + .4753 Xa 
.3245 X, — .5471 X5) —.3093 (—.0711 X, + 
«7090 X, — .0381 Xa — .6943 X4 + 
0935 Xo) + .9772 (.4730 X} — .2161 
Xo + .8246 Xa — .2224 X4, + .0105 Xs) 
= ,2819 X, — .3094 Xp + 1.0020 Xg + .1283 X4 — .2309 X5 
(5.8) 


This equation seems to make sense. The consumption is 
negatively related to Xs, the time variable, and over the 
period the consumption of beer appeared to be declining (other 
things being equal). The correlation with X2, the retail 
price, is also negative as we should expect. On the other 
hand, the dependence on X, (real income) is positive, as also 
is that on the gravity (X4). The only surprising feature is 
the magnitude of the coefficient of Xa, the cost-of-living 
index. Seeing that the effect of real income and retail price 
is accounted for elsewhere, one must suppose that when all 
prices are rising the consumption of beer goes up proportion— 
ately; but this may only be a reflection of the fact that 
when prices rise wages rise and more is spent on beer even if 


real income is constant. 


resting to compare this analysis with one car— 
p. 314 et seq.) by confluence methods. 
gression slopes and graphed 240 bunch 


It is inte 
ried out by Stone (1c. 
Stone worked out 960 re 
maps. He concludes, inter alia, 


(1) the two most important determinants are the two 


price factors; 


(2) the influence of income is negligible; 
(3) there seems to have been little variation attribu- 
table to factors not introduced explicitly; 


(4) the influence of the strength of beer is uncertain 


but positive. 
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Our analysis would support (1), (3) and (4) but is not 
in agreement with (2). 


The real lesson to be learnt from this example, however, 
is that when there are collinearities in the independent 
variables (i.e. when some of the characteristic roots À are 
zero or nearly so) no reliance whatever can be put on indi- 
vidual coefficients in regression equations embodying all the 
variables. 


In fact if the observed regression is 
y =f b; xX, (5.9) 


and there exists a linear relation 21, X, = O, we can sub— 
stitute for any of the X's in the regression and obtain quite 
different—looking results. In our present example two A's 
are near zero and hence two of the five "independent" variables 
are nearly expressible in terms of the other three. 


Suppose for example, that we omit X,, X, and find the 
regression of y on Xa, X, and X,. We get 


Y © 1.2506 XY, + 0.5487 X, + 0.6855 Xs. (5.10) 


If we omit X, and Xs we get 
y = — 0.021,25 X, — 0.4722 X, + 1.0966 Xa (5.11) 


The equations (5.8), (5.10) anā (5.11) appear very different; 
but as regression equations they are almost equivalente The 
squares of the multiple correlation coefficient for the three 
respectively are 0.9896, 0.9676, 0.9808, indicating little 
variation in efficiency of prediction. (It is true that the 
complements of these quantities, 0.0104, 0.0324, 0.0192, are 
substantially different and indicate that the residual errors 
are by no means the same; but where all are so small this is 
not a very material point. ) 
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5.8 We shall not dwell on the use of component analysis in 
standard regression theory, although there seems to be a fruit— 
ful field of inquiry here. The main purpose of introducing 
the topic here was to provide an introduction to a more 

general technique known as canonical analysis. 


We consider a case where, instead of one variate y de— 
pendent on a set of x's we have a set of y's anda set of 
x's which are mutually dependent. We are not particularly 
interested in the relations of the x's among themselves or of 
the y's among themselves, but in the relationship between 


the two groups. 


5.9 There arises here the same sort of distinction that we 
drew between correlation and regression, or between inter— 
dependence and dependence. From one point of view the re- 
lation of y's and x's may be considered as a symmetrical one 
of correlation; from the other, one set, say the y's, are 
regarded as dependent on the other and the x's may, as in the 
univariate case, be "fixed" variables. From most viewpoints 
it is simpler to follow the latter approach, just as it is 
simpler to deal with a regression model than with a correla- 


tion model. But both are in use. 


5.10 Following the line of thought given earlier in this chap— 
ter we might perform a component analysis on both the y's and 
the x's and then investigate the relationship of the trans— 
formed variables. There is something to be said for this 
procedure but in canonical analysis we take a somewhat dif- 
ferent line: we still transform x's and y's to new variables 
which are orthogonal ( independent) but not so as to maximize 
the variance in any particular direction. Instead we maximize 
the covariances (or rather correlations) between certain mem- 
bers of the two sets while reducing the others to zero. Thus 
the relationship between the two groups is reduced to its 


Simplest form. 


Changing the notation slightly, let one set consist of 
p variates Xy see X and the other of q other variates Xp+a 


606 We assume p< qe (In the contrary case we in- 


BS 
vert the oles of the two sets.) 
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Using Greek dummy suffixes for variates in the p-group 
and Roman suffixes for those in the g-group, take new variates 
defined by 


eI ee a ae (5.12) 
B 
my e 2 Chay Gays GE Sin ee acy (5.13) 


If they have unit variance we have 


` = (5.14) 
Z Cap’ Cay Yay = 1 
= 5.15) 
2 dab È v T ( 

when the v's are covariances, We now lay down the condition 
that the covariance of a Pair from the two groups is stationary, 
e.g. 

2 cag dab Vga = stationary = R, say (5.16) 


If À and u are undetermined multipliers, the stationary values 
of (5.16) subject to (5.14) and (5.15) are given by 


Z das Yoy -ALc Va = 0 (5.17) 


(5.18) 


M 
2 
dD 
~~ 
> 
i 
T 
M 
A 
= 
>~ 
u 
fo} 


Multiplying these two respectively by d ps Capena summing for 
a or a we find 


(5.17) and (5.18) are then soluble in their determinant 
vanishes, which leads to 
ae Yap Vaj T) EF EU ogo 7) 


aO; (5:19) 
AA GC) 
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which is equal to 


2 
e [Rap Yay 


= 0, 
Vip Vij (5.20) 
Premultiplying this by 
ik 
Say — v Vyk 
o prt 


where v*® is inverse to v;,anav’t to V1;, we find 


cage Muay - Bug vik vy, o 
Quit vip vl’ vij 
2 , 
= (9-7? | vy = Zug vee Vyk | = {9 
(5.21) 


a determinant of p rows and columns. 


2 

We shall suppose that the roots in À are distinct. (The 

contrary case is tractable but of secondary interest.) We 
choose the p positive roots in A. These are correlations 


from (5.21) and the corresponding transformations to C's and 


N's exist. We may also show that the variables have the 


following properties: 


(1) All Č's and N's have zero mean and unit variance; 


(2) Any © is independent of any other C and any 7 is 
independent of any other N; 


(3) The correlation between any Č and any T) is zero except 
for p correlations 0, ---- Pp which may be taken as 
the correlations between Č, and n,, Č, and 7,, etc. 
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The dispersion matrix then becomes 


(5.22) 
The determinant is 


$ 2 2 (5.23) 
E Senio A | i = py) 


5-11 We notice the close resemblance between (5.21) and the 
characteristic determinant of component analysis. A pais 
cal solution may proceed by iteration in the manner pee 
in Chapter 1. A numerical example of Hotelling's will make 
the arithmetic clear, 


Example 5.2 


ts 

140 seventh-grade schoolchildren received wit four E 

on (a) reading speed, (b) reading power, (c) arithmetic m 
and (d) arithmetic power. The correlations in performan: 


were as follows. (Figures to four places for greater ac- 
curacy. ) 
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1 -6328 22412 20586 
1. —.0553 20655 
1. 24248 

1. (5.24) 


The determinant (5.19) is written down as 


-À —.6328À 02412 . 0586 
—. 6328) -À -.0553 20655 
= © 
+2412 -.0553 -À -. 
+0586 .0655 —.4248) -À 
(5.25) 


For determinants of larger order a systematic method of 
solution is desirable. Writing (5.25) in the form of (5.20) 


x 6928). +2412 +0586 
peazeh NS -.0553 .0655 
+2412 -.0553 1.0000 +4248 2 
.0586 0655 4248 1.0000 


2 
"Sweep out" the À from the first row, second column and then 
from the second row, first column, to get 


2 


A o «241,200 +058, 600 
o .599,564\ —.207,931 -028,418 
+241, 200 —.207,931 1.000, 000 -424,800 | $ 
+058, 600 . 028,418 +424, 800 1.000, 000 
Then Sweep out the first three items in the bottom row to get 
Nooa, 484 —.001, 665 +216, 307 

=.001, 665 599,564) —.000,800 —.220,003 | = 0 


°216, 307 —. 220, 003 -819,545 


80 
Then sweep out the first two items in the bottom row to get 


2 
A = .060,525 -056,401 


2 
+056, 401 «599,564 — .059, 867 
(5.26) 


This is a general procedure and we end up with a determinant 
of type (5.26) from which the À can be obtained by the 
iterative method we used for principal components in Chapter 
1. In our present case it is simpler to evaluate (5.26) as 
a quadratic in \?, 


4 2. 
+599,564\ — .096,156\ + .000,442,377 = O 
giving A = .3045 or 0689 . 


An examination of these values suggests that the cor— 
relation represented by 0.0688 is negligible and the relation- 
ship between the reading tests and arithmetic tests is con- 
centrated in the first correlation. If necessary we can find 
the canonical variates to which this correlation corresponds. 
From (5.17) we have, writing C; (j = 1,2) and d; for dij 


C, + .6828c, - 64d, - .1ae5d, = O 
T ee a E e E O 
= 6114c, + 14c, Pid yi eE E O 
~ 1485 c, - Ce + Z EEEE P 


Only three of these are independent and we solve for the ratios 
omiies d :d, = -e.mme: 2.2655 : A, AOA 


Thus our new variate G is given by a multiple of -2.7722 X, 
+ 2.2655 x, and N, by a multiple of — 2.4404 x, +x,. These 
are independent of g and 1, and the correlation between iS 
and N, is 0.3045, That between Č, and np, G and ny is 
exactly sero and between (e and n, nearly so. The relationship 


81 


is thus nearly summarised in the single canonical correlation 
0.3945. 


5.12 Considered as a technique (like component analysis) of a 
purely mathematical kind for the convenient representation of 
data, canonical analysis has obvious advantages. Instead of 
operating on correlated data and disentangling the effects 
afterwards we attempt a disentanglement before the analysis 
begins by transforming to new variables with as much indepen— 
dence as possible. As in the component-analysis case, (and, 
indeed, as in ordinary scalar regression theory) we may then 
have to face the question whether our new variables have any 
obvious interpretation and can be identified with something 
"real", or whether they are to remain mere artefacts brought 
out by the mathematics. It would have been instructive at 
this point to give a number of practical examples of canonical 
analysis like those on factor and component analysis in Chap- 
ter 2. But, unfortunately, the technique has not yet been 
applied at all widely and there is a shortage of good illus- 
trations. Theory, though far from complete, has outrun prac— 


tice. 


Example 5.3 
(F. V. Waugh, 1942, Econometrica, 10, 290.) 


Waugh took, for each of the years 1921 to 1940 inclusive, 
the prices of beef steers and hogs and the per capita consump— 
tion of beef and pork (excluding lard) for the U.S.A. The 
prices were "deflated" by dividing by an index of per capita 
income, that is to say they purport to measure the changes in price 
relative to a stable value of money, and are given as dollars 
per 100 lbs. at Chicago. The consumption is given in pounds 


per annum. 


We thus have, for 20 years (n = 20) a multivariate situa- 
tion with p = 2, q =2. We require to discuss the question 
how far meat consumption and meat prices are related, "meat" 
for this purpose including beef and pork but not mutton or 


chicken or minor sources of meat. 


is a simplified form of the problem of 


This, in one way, n 
ctively we need only one linear 


Canonical analysis because effe 
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combination of the price — and consumption — variables and 
the greatest canonical correlation. The others are of minor 
interest. The correlation matrix was 


x, x, x, X 
X, (steer prices) 1. -18126 —.56396 -. 49898 
x, (hog prices) 1. 235494 —. 75671 
X, (beef cons.) 1. -.10293 
ZA (pork cons. ) 1. 


Let us note that these correlations make economic sense. 
The correlations between steer prices and beef consumption 
and between hog prices and pork consumption are negative; a 
rise in price means a fall in consumption. But the corre— 
lation between hog price and beef consumption is positive; 
when pork goes up in price there is a switch to beef, the con— 
sumption of which also goes up. (But the correlation between 
steer prices and pork consumption is negative so that substi- 
tutional effects are not entirely straightforward). 


The classical way of discussing the question would pro- 
bably have been to form an index-number (a weighted average) 
of prices and another (also a weighted average) of consump— 
tion and to investigate the relation between the two. The 
weights used in constructing these indices would have to be 
selected on prior grounds of a somewhat arbitrary kind: and 
the resulting correlation would, of course, depend on them. 

If we adopt the standpoint of canonical analysis the weights 
are determined for us by the condition that the correlation is 
a maximum. But we must always remember that there is nothing 
in the economics of the situation to compel the supposition 
that correlations are maximized. When we come to the stage 
of interpretation we shall encounter the same difficulty that 
we have already met in component analysis: of knowing whether 
our linear functions correspond to anything "real" or whether 
they are merely matters of mathematical convenience, 
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The greatest canonical correlation in this present ex- 
ample was —0.84666 (greatest, that is, in absolute value). 
The new variables, in terms of the standardized X's, were 


€, = constant (52.62%, + 47.36 X,) 


7, = constant (25.38%, + 74.62 Į) (5.27) 
where we have chosen the weights so as to add up to 100. 


On looking at these values we see that the signs at least 


are acceptable. IS is an average of prices with nearly equal 
weights, a reasonable average for prices which both relate to 
the same quantity of meat; and the weights in 7, are also 
both positive. Whether they are "reasonable" in the sense 

of providing a good index—-number of the consumption of meat 

is another question, and one which has no answer unless we 
Can say for what purposes the index is intended. 


We may contrast this situation, which appears to give 
acceptable results, with the analysis of Example 5.2 where 
the canonical correlation was lower, 0.3945,and the canonical 


variables were 


É = constant (-2.7722%, + 2.2655 X,) 
n, = constant (2.44047, + X,). 


X, and X, were tests of reading (speed and power) and J, and 
x, were tests of arithmetic (speed and power). Since the 
Coefficients in both the canonical variables are of opposite 
sign we cannot regard, say,C,, as expressing some general 
Capacity in reading formed by superposing speed and power. 
Had they been of the same sign we might well have suspected 
that the two were being combined into an index of ability to 
Tead, As it is, we seem led to the conclusion that speed 
and power are very different things and cannot be amalgamated 
in such a simple way as is embodied in the use of linear 


functions. 
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Example 5.4 
(F. V. Waugh, loc. cit.) 


The previous example should be regarded as illustrative 
only. The observations are correlated in time and it might 
be possible to obtain "nonsense" canonical correlations just 
as it is possible to obtain nonsense correlations in ordinary 
time-series analysis. The following example is free from 
this difficulty. 


Information was available about 138 samples of Canadian 
Hard Red Spring wheat and the flour made from them. For the 
wheat five measurements were obtained: 


x kernel texture 

Xx test weight 
damaged kernels 

4 foreign material 


x, crude protein in the wheat 


For the flour there were four measurements 


X, wheat per barrel of flour 
ash in flour 


x, 
x, crude protein in flour 
X 

2 


gluten quality index 


Here n = 138, p = 4, q =5. The correlation matrix was 


1 2 3 4 5 6 7 8 9 

1 1. .75409 -.69048 -.44578 .69173 -.60463 —.47881 .77978 -. 15205 
2 fs ~-71285 -.51483 .41184 -.72236 -.41878 .54245 —. 10236 
a sb +52526 -.44393 .73742 . 36152 -. 54624 .17224 
4 IS =. 33439 .52744 .46092 -.39286 -. 01873 
: te =. 58310 ~.50494 .73666 -. 14848 
i 1. +25056 -.48993 .24955 
i ila =. 43381 -.07851 
; ite =. 16276 


4. 
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The first canonical correlation is + 0.909,388, an un- 
usually high value. The first two canonical variates are 
Č, = constant (.35771 X, + .20508 X, - .56005 X, — .aa7ao 1, 
+ .50449 X.) 
N, = constant (Ñ — 0.53727 X + .84773 X, + .04578 Zo)? 

(5.28) 

where, as usual, the X's are expressed in standard measure. 
These are the (linear) index-numbers which give us the maximum 
correlation between the properties of the wheat and those of the 
flour. 


1 


Here again, having carried out the analysis, we need to 
look very carefully at the results to see if they make sense. 
On the whole it appears that they do. For example, in Č, the 
signs given to kernel texture, test weight and crude protein 
are positive, these being the variables for which high scores 
indicate greater value; and those for the detrimental quali- 
ties of damaged kernels and foreign material are negative. In 
N, the wheat per barrel and ash content are negative and the 
crude protein and gluten quality positive. For these signs 
the canonical correlation is positive. 


We could, of course, get equivalent results by changing 
all the signs of one of the canonical variables and taking the 
Which method of presenta— 


canonical correlation as —.909388. 
The 


tion we choose depends on the individual circumstances. 
"constants" in (5.28) are positive. 


It is, unfortunately, necessary to add that Waugh, to whom 
the analysis is due, carried out similar analyses on U.S. Hard 
Red Winter wheat, and found that although the canonical corre— 
lations were higher than for Canadian, the signs of the coeffi- 
cients in the canonical variables were no longer satisfactory 
in all cases. Waugh suggests that the variables were domin- 
ated by xX. and Xa) representing gluten content, which pas Ji 
versely related to some of the other desirable characteristics. 


hat in work of this character there may 


It appears to me t s i 
arise instabilities in the coefficients like those in Example 
e ought to work out the other 


5.1. To decide the question w 
multicollinear effects are 


canonical roots and see whether : 
present. But the arithmetic would be formidable, although 


not beyond our capacity. 


6. SOME PROBLEMS OF SAMPLING 


6.1 In multivariate analysis, especially in those branches 
which deal with components, factors and canonical correla- 
tions, exact inferences from sample to parent offer some ex- 
ceedingly difficult problems in distribution theory. In this 
chapter. we shall attempt a sketch of what is known and what 
is unknown about these problems, but it will necessarily be 
a somewhat cursory survey. For a more detailed treatment 
of the mathematics reference may be made to Kendall's Advanced 
Theory of Statistics, volume 2, Wilks" Mathenatical Statis- 
tics and Rao's Advanced Statistical Methods in Biometric Re- 
search. The difficulties are not, however, solely of a 
mathematical kind; some of them concern the proper model 
which should be set up in particular cases. 


6.2 We begin with an account of the sampling distributions 
required. Practically everything that is known about exact 
distributions in the multivariate case depends on the assump— 
tion that the parent distribution is multivariate normal. 
This assumption will be made throughout. 


It will be recalled that in samples from a univariate 
normal population the mean and variance are independently 
distributed. Also, in samples from a bivariate Population 
the means are distributed independently of the variances and 
covariance. This type of result holds generally. In samples 
from a p—variate normal Population the set of p means are 
distributed jointly normally, the correlation between x, and 
Xj being Pij? the correlation between xX; and x, The co- 
variances and variances have a much more complicated form 
known as Wishart's distribution, Suppose we have a multi- 
variate distribution with dispersion (covariance) matrix 


-1 
a Soy (6.1) 
86 
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where we take the variates to have sero means and unit varian— 
ces. (A simple transformation gives us the more general case 


if we need it.) The distribution is then 
lal: 1 
dF = exp {-= Aj, xix} Idx (6.2) 
(aye? R 
Let the sample covariance be 
al = a 
aj = a (x; - %) (xj - 7), (6.3) 


where the summation is over the sample values. The joint 
distribution of the a's is then 


+p(n-1) $(n-1) +(n-p-2) 
dF = (an) lal lel exp {- S 2A, ja; j) Hda 
+p(p-1) Ê 1 2 
T O T { =n - k)} 
k=1 2 


(6.4) 


Two points are to be noticed about this. First, the product 

Ila takes place over all a,,, but not over both, for example, 

a,, anda,,. There are dnn + 1) terms and more explicit- 
21 


12 } 
ly we should write the differential element 


> 
I da,,. (6.5) 
ay Ree T 

On the other hand the summation in the exponential takes place 


over alli, j so that any term, say A, 4309 occurs twice. 


oint is that n is the number in the sample 


The second 
A z Some writers take it as 


and will always be used as such. 
the "number of degrees of freedom" on the ground that the mean 


has been factorized off from the frequency function. This 
would involve putting n instead of n — 1 in (6.4) and taking 


the sample number as n + 1- 
he distribution of sample 


4) by integrating out the 
ome intractable integrals 


6.3 Theoretically we could get t 
correlation coefficients from (6+ 


variances. This, however, leads to s See ; 
(cf. the case wien fetta aise good many distributional pro- 


blems based on Wishart's distribution have to be tackled by 
roundabout methods, €ege by finding lower moments and approx- 
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imating, or by reducing to the case where variates are inde— 
pendent. 


In the case of independence we can obtain the distribu- 
tion of sample correlations explicitly. The Ai; are then all 
zero for 1 # j and the covariances do not appear in the ex— 
ponential of (6.4). A simple variate transformation then 
gives us for the distribution of the r's 


4(n-$-2) pri 
|R| CT m-1)} J 
dF = ; 7 L H dri; (6.6) 
m? p-1) I Tigin -k)) tSJ 
where |r| is the sample correlation determinant lame If 


pb = 2 this reduces to the familiar form for the bivariate 
case. 


6.4 The actual integration of (6.6) with respect to individual 
r's is complicated, not only because of the nature of R but 
because of the ranges of the Tije However, we can obtain 
from it the moments of the determinant \R| by an argument due 
to Wilks. In fact 


_ LP tgim-1)} ie 


LP in-1) +2} 7? 


{Z(n—k) + t} 


xw tad 
Hej oe 
n 


1 


ELIRI 


Ca] I iam 


{h(n -k)} 


(6.7) 


Bartlett, identifying the lower moments with those of a 
Ne distribution, obtained an approximation; 


2. 


{ 1 
x? = -{(n-1) mach 5)} log |R| (6.8) 
is distributed as X with App — 1) degrees of freedom. 
6.5 An alternative and somewhat heuristic method of deriving 


the result, also due to Bartlett (1948, Brit. J. Psych. Stat. 
Section, 1, 73), is given by the consideration that 
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Cir?) (RE Fas Go ee RS epee ail (6.9) 


where the R's are multiple correlation coefficients. The 
various factors on the left here are independent in the case 
of parent independence. The logarithms of these factors 
multiplied by a factor in n are approximated by x? distribu- 
tions: 


1 

— {n - =(3)} log (2 - rîa) with 1 d.f. 
2 12 
1 

-dn - za) log (1 -R2 ,,) with2 d.f. 


1 
= {n-=(p-1)} 1og(1— R32.. (p1)! With b-1 d.f. 


Replacing the various factors in braces on the left by their weighted 


1 : ; 
average n -z (2b +5) we arrive at a X° given by (6.8) with 


ž ( 
a? p—-1) d.f. 


6.6 It is generally true in this field that the distributions 


of determinants and determinantal ratios are easier to explore 
than those of their constituent items. They do not always 
give us the tests we should like, but in the present state of 
knowledge we have to be thankful for any test of even tangent— 
ial relevance to the point under inguiry. 


jon due to Hotelling bears the same relation 
he normal in the univariate 


be the matrix inverse to 


6.7 A distribut 
to Wishart's as Student's t to t 
case. Let by nay; and let Ce 
bije Then 


q2 = n(n—1) Dez, X; %; (6.10) 


is distributed in a form given by 
4) (2-2) ; 
1 {T2 | (n-1)} tet tee 


iif Et a DS eas 
BG n-p), 39) 0 oe 
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5.8 One other set of distributionalresults is continually 
recurring in multivariate analysis. They concern the dis— 
tribution of the characteristic roots of determinantal ex— 
pressions in which the elements are covariances or analogous 
quantities. A great many branches of multivariate theory 
can be regarded as special cases of canonical regression or 
correlation and the basic results nearly all rest on a pro- 
cedure which is common to them all. If L represents a vec— 
tor of p quantities and A, B are p x b matrices, we require 
to maximize for variations in L the quadratic form L'AL under 
the restriction that the other quadratic form L'BL is con- 
stant. This, in fact, is effectively what we did in com- 
ponent analysis, with A as the correlation matrix and B as 
the unit matrix. Such problems lead to the solution of de- 
terminantal expressions of type 

|A-AB| =o (6.12) 
In cases where one or both of A and B are subject to sampling 
fluctuations there will be generated a distribution of the $ 
values of À which are the roots of (6.12). Our interest 
centres in this distribution, and particularly in the dis- 
tribution of largest, second largest ... roots, 


6.9 Such distributions are exceedingly difficult to obtain. 
It is possible to find, in the normal case, the distribution 
of all roots together and of certain symmetric functions of 
them (in particular, ‘their sum); but little exact knowledge 
is available about the individual roots except in the asymptotic 
case of large samples. This is a pity, because a thorough know 
ledge of the distributional situation would have very many 
applications. Apart from purely mathematical complexities 
the difficulties are three-fold. 


(a) First, the number of parent parameters of a normal 
population rapidly increases with p, there being 4p(p + 1) 
dispersion parameters apart from the p means. The number of 
possible nuisance parameters can therefore be extensive, and 
their removal by Studentization accordingly much more diffi- 
cult to carry out. 


(b) Secondly, the matrix B in (6.12) is sometimes 
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unknown and unobservable (or at least unobserved), as for ex— 
ample in the factor analysis case or the estimation of para- 
meters in functional relations with several variables subject 
to error. Strictly speaking this is not a distributional 
problem; it arises from the model and the experimental situa— 
tion; but it is none the less a severe handicap. 


(e) Thirdly, in the particular case of characteristic 
roots we are faced with a problem which scarcely arises in 
ordinary distributional theory, namely the problem of identi- 
fying the variate we are discussing. If we imagine a set of 
values for A and B inserted in (6.12) and then allow A and B 
to vary slowly, we see that at certain points the roots in À 
"change places", the one which was the second highest becoming, 
perhaps, the first highest. If the roots are all distinct 
and the sample is so large that the variation is small compared 
to the distance between them, they will retain their order and 
can be identified. This is one fundamental reason why we can 
make more progress in the asymptotic case, when the order of 
the roots is undisturbed by sampling effects although their 
values may alter. 


6.11 An exact expression for the joint distribution of the roots 
is obtainable when the parent variates are all independent. 

The distribution can be put in various forms and we will give 
one of them purely as an illustration of the kind of form which 
arises, 


In the distribution (6.4) suppose that all the parent co- 
variances are zero and the parent variances unity and consider 


the roots of 


"a= Nal = to (6.13) 


The determinant a is the product of the roots, say Agee eAgs 


and the term in the exponential reduces to —3n times the trace 


of a, which is the sum of the A's. Consequently the frequency 


element of the joint distribution of A's is 


$ 
Dez ; 6.14 
c Opa e t-2) exp ( -łn 2, ( ) 
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where C is a constant. The only difficulty arises from the 
evaluation of the differential element. This may be shown 
(e.g- Mood, 1951) to be 


‘ D A; -àp Idie 

The evaluation of the constant can be carried out rather 
tediously. For our present purposes it is enough to note 
the form of the frequency element (the Fisher—Hsu-Roy distri- 
bution): 


b 
Cody PP?) exp t -pn ZO Ap TO AN 
i =a i<j J 

Ane IN 2 Ve AGEN (6.15) 
6.12 Roy (1945 and many subsequent papers) has studied this 
distribution, the integral of which is a kind of generalized 
Beta-distribution. Rees and Foster (1957) have 
begun the tabulation of the percentage points of the largest 
root, the latter's work contemplating tables for p = 2, 3, 
4 and 5. 


An approximation due to P.L. Hsu (1941) states that the 
sum of the r smallest roots can be tested in the xX? distribu- 
tion. More specifically if 


ia E SHON 


v6 


yy IN EA «Ayer ta) (6.16) 
then AL is distributed as yx? with r(n-p+r) degrees of free- 
dom. The A's here are the roots of (6.13), not of the corre- 
lation matrix,i.e. of the dispersion matrix. Something of 
this kind is not unexpected for large samples. Each NS in 
fact, is the variance of a principal component (when we stand— 
ardize the covariance matrix to turn it into a correlation 
matrix) and the sum of r such, each with n d.f.,would have 

Tn def. This remark is to be regarded as an observation on 
the consistency of the result, not as an unrigorous proof. 


T.W. Anderson (1948, J. Roy. Statist. Soc. B, 10, 132) 
takes this a little further by showing that (A, -rn)/Yn 
is* asymptotically normal with mean zero and variance 2r. 
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6.13 The quotation of results like this out of their context 
is, however, a little dangerous. The distributions and the 
tests based on them depend on the model under consideration 
and circumstances will alter the cases. This is only another 
way of saying, of course, that the test depends on the hypo— 
thesis; but the warning seems especially necessary in multi— 
variate analysis. 


Significance and estimation in component and factor analysis 


6.14 In the component-analysis approach we do not postulate 
the existence of any "structure". We suppose that our ob- 
servations x,, are chosen at random from a p-variate popula- 
tion;that they are observed without error; and we narrow the 
sampling discussion by imposing the condition that this popu- 
lation is normal. Corresponding to any component analysis 
on a sample we may imagine an analysis carried out on the 
parent. Our sample values of À are then estimators of parent 
values and our sample l's are estimates of parent values. It 
is meaningful to inquire how close the sample values lie to 
the parent values, to try to set confidence limits and so forth. 
This is the familiar situation of statistical inference where 
we are trying to estimate from sample to parent. 


6.15 In such a case we are effectively looking for a variate 
transformation from x's to C's with the properties that the 

new variates are independent and that they account for as 

much variance as possible in descending order. Whether these 
variates can be identified with anything real or with a struc— 
ture is not, at this stage, under examination. We may suppose, 
as we wish, that the individuals of the population are charac- 
terized by x's or by C's and the values which emerge for scru- 
tiny in the sample are selected at random. 
we have to estimate the constants in 


In other words, 


(6.17) 


but the Č's are variates. 
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6.16 Except in the degenerate case when the variation lies 

in fewer than p dimensions, the parental characteristic 
equation of type | R-N = O (where R now relates to 
parental coefficients and À to parental values) is the limit- 
ing form of the sample equation. A simple appeal to con- | 
tinuity then suggests that the sample values are consistent 
estimators of the parent values. 


Furthermore, the observed dispersion matrix in a normal 
distribution, say V, is easily seen to be a maximum—likelihood 
estimator of the parent dispersion matrix, say C; and indi- 
vidual variances of the sample are maximum-likelihood esti- 
mators of the parent values. We may then say that the sample 
correlation matrix r, as obtained from the sample dispersion 
matrix by division by the appropriate sample variances, is a 
maximum—likelihood estimator of the parent R; and in this 
sense the sample characteristic roots are the maximum-likeli— 
hood estimators of the parent values. 


There is a pitfall to be avoided here. If we make a 
variate transformation 


Ge Z Bix, (6.18) 


to independent Gish the new frequency function gives a 
logarithm of the likelihood proportional to 


shes es ear ae Me YE a 


n 
ey Sey =} 
fects pn a log hy aR i Pik Pintn l N 
sk, 
(6.20) 
n 
lepli 
If we maximize this for variations in the À's we get 
SL 26 B r (6.21) 


kyn ik in hm 
but this is not a maximum-likelihood solution in the ordinary 
sense. The likelihood, in fact, is a constant under the 
rotation of (6.18). What we have done is to find the sample 
principal components. 
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6.17 This case is accompanied by one curious feature, namely 
that no non-zero À requires any test of significance in this 
sense: a sample with d effective dimensions cannot arise 
from a population with fewer than d dimensions. However 
small a value of À may be, therefore, it cannot arise from 
a parent value of zero. It may be negligible but it is 
always significant. 


Nevertheless there is a sense in which we can "test the 
significance" of a set of A's; or, better put, in which we 
can test their distinguishability. When we extract a num 
ber of components from a p-way complex there usually comes a 
time when the remaining A's, though not vanishing, are rela- 
tively small. We may then ask ourselves: is the extraction 
worth continuing? This means that we suspect the remaining 
variation to be dimensionally isotropic — it could have a- 
risen from a population in which all remaining A's are equal, 
in which case there is no point in trying to maximize the 
variation in any particular direction. We therefore seek 
for a test of this property. Such an approximate test has 
been derived by Bartlett and its nature will be clear from 
one of his examples. 


Example 6.1 
(Bartlett, Brit. J. Psych. Stat. Sect., 1950, 3, 77) 


Hotelling has previously considered some data by Kelley 
obtained with four tests of reading speed, reading power, 
arithmetic speed and arithmetic power, carried out on 140 
children. The correlation matrix was 


fa 698 +264 .081 
1 -—.061 2092 
1 504 

L 


The four roots of the characteristic equation were: 
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A 1.846 
M 1.465 
As 0.521 
, 0.167 

3.999 


and |R| = 0.2353. 


A preliminary look at these results suggests that the 
first two components are meaningful; but there is a sharp 
drop from the second to the third and we are inclined to doubt 
the reality of the last two. Can we make this doubt into a 
precise hypothesis and test it? 


In virtue of what we have already said, the third and 
fourth components must be "significant" if there is no error 
in the observations. Their importance is judged in the first 
instance from the magnitude of the observed A's. 


We may then argue in this way: we refer to the distri- 
bution of (6.6), that is to say the correlation distribution 
in the case of parental independence. The entire structure 
may be tested from (6.8) by testing IR, or rather 


— {in - 1) -8 (2p + 5)} og |R] (6.22) 
as X with 4 p(p -1) at. 


Here |R| = 0.2353, p = 4and nis 140. Actually, for 
reasons which will appear presently, we take the factor of log 
|R| in (6.6) as n - p - $ = 185.5. We therefore have to test 
-135.5 log .2353 = 196.0 with 6 d.f. This is highly "sig— 
nificant". 


The interpretation is that such values could not have 
arisen from an uncorrelated parent. (Little is known of the 
Power of such a test, but an intuitive judgment would suggest 
that its power is reasonably high against normally correlated 
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alternatives. ) We therefore attribute "significance" to the 
first component. 


Now arises the problem of extracting the first component 
and testing the residual; and so on. Bartlett proposes, 
after k roots \,..-A, have been accepted, to test the remainder 


by 
XL = -{n-1-§ (ep +5) -Bk} log Rpp (6.28) 


with b (p-k) (p-k - 1) d.f. where 


IR| 
Ry, = (6.24) 


p-k i 
SU SHON x ae 
Neeg (Pye 


If n is large compared to p we can use the smallest, corres— 
ponding ton -p-#, for all ke This has the advantage that 
the values of x are additive. The point of (6.24) is that 
since |R| is the product of all the A's, Rpp is the product 
of the remaining p — k, with a standardizing factor necessary 
to bring the sample variances back to unity. The assumption 
is that in "removing" the first k we are left with a p — k 
complex of variation which we wish to test. 


In the arithmetical example given, 


R, = 0.521 X 0.167 X (2/0.689)* = 0.7330 
R, = (0.2353 / 1.846) X (3/2.154)? = 0.3444 
R, = |R| = 0.2353. 


For the tests of the A's we then have 


def. 
6 =195.5 (log +2353) = 1096.0 
3 -135.5 (log +3444) = 144.4 

= 42.1 


1 -135.5 (log »'7330) 
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All the values are "significant" and we conclude that the A's 
are effectively differentiated. 


6.18 The test requires a modification when the analysis into 
components is carried out with standardization after evalua- 
tion of the latent roots (Bartlett, 1951, Biometrika, 38, 
337). In component analysis as we have expounded it we 
standardize the variates at the outset by dividing the x's 

by their respective sample variances. We could have pro~ 
ceeded without standardization by ascertaining the latent 
roots of | V-M | = O where V is the dispersion matrix; 

but this renders us dependent on the parent variances. Sup- 
pose we assume them equal but still work on the observed dis— 
persion matrix. To standardize the values of À we divide them 
by their mean, which is the same as the mean observed variance; 
the sum of the standardized A's is then equal to p. In such a 
case Ee tna — 2 (ep 1 BU) toga meres) 


is distributed approximately as y? with $ 5 (p+ 2)(p = 1) 
degrees of freedom. 


6.19 We may remark in passing certain results in the sampling 
theory of canonical correlations. 


(a) Hotelling (1936) considered the large—sample theo— 
ry of canonical roots. On the basis of an underlying multi- 
variate population the correlation between two different roots 
is zero and the variance of a canonical correlation oi (the 
positive root of A?) can be tested as if it were an eraser 
product-moment r by the formula 


var r, = # (1 ~ 92)? (6.26) 


This suggests that Fisher's tanh t transformation would be 
useful in stabilizing the variance of canonical correlations 
and in normalizing the distribution, but the point does not 
seem to have been explored. We are assuming here, of course, 
that the roots of the characteristic determinant are all 
distinct. 


(b) The distribution of the roots À? of the determinant 


——— a 
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(5.21) are of a form resembling that of the characteristic 
roots of an equation, with frequency element proportional to 


p 4 q-p-1) A (neqep=2) 2 [A 
-q-b 
Tl wu, us H 
i=,"i egies) re yl dus (6.27) 
where u; = Ne and it is assumed that there is no real corre 


lation between the two groups of variates. In contradistinc— 
tion to the principal components case this distribution gives 

a test of the reality (difference from zero) of the canonical 

correlations, not merely their separability. 


(c) Corresponding to the regression of a scalar variate 
y on a set of x's we can regard canonical correlations or re— 
gressions as part of a theory generalizing the ordinary uni- 
variate theory. The natural generalization of the corre— 
lation between a p-way and a q-way vector is the vector corre— 
lation coefficient K = p, P, ++. Pp, the product of the 
canonical correlations; and the generalization of the alien- 
ation coefficient 1 — p? of bivariate theory is the vector 
alienation coefficient Z = (1-{) (... ) (1- 0%) The 
sampling theory of both Ķ and Z is easier than that of any 
one canonical correlation. 


Estimation in factor-analysis models 


6.20 


lawley's investigation 


Lawley (Uppsala Symposium on Psychological Facto: Analysis, 
March 1953) has discussed the large sample theory of estima- 
tion in factor—analysis, this work replacing some earlier 
investigations on the subject. He begins with the model 
(in my notation) as 


(6.28) 


R 
" 
ima 
m 


+ 
2 as, G, Ei 


There are m < p factors and the term £ here is a residual 
which may be either specific or an error of observatmon in 
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the x's. All we require of it is that it should be normally 
distributed with zero mean and variance 07. We assume also 
that the ratios of these variances are known (or, better still, 
their actual values). We may then re-scale the observations | 
so that with new x's in (6.28) all the €'s may be taken to | 
have the same variance 0. Two cases then arise, according 

as o? is known or not known. 


(The other case when the ratios of the error variances are 
unknown remains open for inguiry. I shall not discuss it here. 
There appears to be the same essential discontinuity between 
the known and unknown—-variance—ratios situation as in the esti- 
mation of constants in functional relationships considered in 
chapter 4; and this is not surprising when we consider that 
we can eliminate the €'s from p equations of type (6.28) to 
get a set of p-m equations in the x’s subject to errors £.) 


6.21 Lawley then considers the likelihood function and derives 
some results which we will briefly indicate without detailed 
proof. 


(a) Let C be the population dispersion matrix and V the 
sample matrix. Let @, denote the (column) vector (yy, Oo pe 
3 ee A k) and circumflex accents denote maximum—-likelihood 
estimators. Then in matrix notation 


a, = â +29 

Va, i, ay (6.29) 
ara 

p = %, %, + 0? (6.30) 


Thus we may take the m largest latent roots of V as estimators 
of À. We assume that 7 is so large that the latent roots are 
distinguishable with negligible probability of being wrong. 


(b) If, in addition, 0? also has to be estimated we have 
a 


(p-m) 6 = tr Pere Reo (6.31) 
ied 


or 6? is the mean of the p-m smallest roots. 


| 
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(e) A criterion based on likelihood ratios may be de— 
rived to determine whether m factors Č are enough. The 
eriterion is 


n {og léel/|vy| + tr (Vet) -+ }, (6.32) 


where ¢ is C with the likelihood estimators inserted. This 
is distributed approximately as x? with 3(p—m) (p-m-1) 
degrees of freedom. 


But if o? is also estimated the criterion is 
n tog {lel WI} (6.33) 


with 1(-m-a)(p—mt2) degrees of freedom, one less than be— 
fore. 


(a) The asymptotic variance of A, is 
à an, 
LON, I (6.34) 
n 


and asymptotically two different A's are independent. 


(e) The variance of 62 is given asymptotically by 
A 2 of 
var 07 = = (6.35) 
n p—-m 


(£) If o? is known 


A rp 
cov z [c Qik Qir 
Ci ia AE gee 4 
aa A n (A, - 97) tj 2(A, - 0°) 
À Ny = OANA 
+3 t Oys A G waved (6.36) 


t7k A, - 0° (Ap - Ay)? 


where c;; is an element of C. 


J 
Also we have 
= Mads 


= Q, %4,. (6237) 
cov (Hyp, 55) ONT it “jr 
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(g) If 0? is not known we have to add to the right- | 
hand side of (6.36) a term 


4 


GS MEATA (6.38) 
en(p - m) (À, - 0°)? LEETS | 


and to the right-hand side of (6.37) a term 


o* 


an(p — m)(A, -= 07) (A, = 0?) 


hin tie + (6.39) 


6.22 These results make it possible to apply at least rough 
tests in the factor—analysis case we are discussing. They 
are sufficiently complicated but not unmanageably so; and 
it is interesting to find expressions like rp - Ay appearing 
as denominators to mark the high sampling variability of 
results when two latent roots become close together, For 
small n, or for the case where the error variances have un- 
known ratios, much remains to be done. 


6.23 We should also mention a different model from that of 
(6.28), discussed by Young (1941, Psychometrika, 6, 49) and 
Whittle (1953, Skand. Akt. 35, 223). 


We write 
m 
AI N OP, H er (6.40) 


where the £'s are still stochastic error terms but @'s and 
G's are parameters under estimate. We now suppose there to 
be an underlying structure connecting the variables. Apart 
from the error term any Xi; is composed of a linear sum of 
products of two components, Prj is a quantity varying from 
individual to individual — we suppose that there are "factors" 
which for any variate 1, appear in the individuals to differ- 
ent extents. The weights @ are independent of the individual 
and measure the extent to which a factor appears in the ith 
variate. No hypothesis concerning the distribution of $ is 
required. As against this, we have more parameters to es— 
timate. In Lawley's case (6.28) there are, apart from error 
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variance, p m parameters @. In the Young—Whittle case there 
are in addition 2 n O's making m(p + n) in all. There seems 
to be no obvious reason a priori why an analysis of this 
situation should bear any resemblance to the other. 


6.24 It may be shown, however, that if we perform a least— 
squares analysis, i.e. minimize absolutely 


zz Cae d, .)? (6.41) 


we do, on certain assumptions, arrive at the characteristic 
equation and the usual solution for the Ms. The validity 
of the least squares approach depends on an assumption that 
the error variances are all equal or that their ratios are 
known and hence that the measurements have been standardized 
so as to make the error variances equal. In this case the 
tests of "significance" of the latent roots and the error 
variances are different from those of Lawley's case. 


I am bound to record my opinion, however, that the model 
represented by (6.40) is one which is not customarily required. 
And indeed, I do not understand its nature. Dr. Whittle and 
I have had a good deal of correspondence on the subject, at 
the end of which neither of us had succeeded in persuading 
the other to his point of view. The reader should be warned, 
therefore, that paragraphs 6.23 and 6.24 may inadequately pre- 
sent this part of the subject and should, if he is interested, 
pursue the topic in Dr. Whittle's papers. 


6.25 This branch of the subject needs a good deal more inves- 
tigation (a) to clear up confusions, (b) to obtain more ef- 
fective moderate-sample tests and (c) to deal with tests after 
certain components have been extracted. Imperfect as may ve 
the light we can throw on the subject, however, it is dazzling 
compared to the obscurity surrounding questions of signifi- 
cance in analysis by methods other than principal components 
such as centroid factor analyses. 


6.26 The major question which has always exercised factor 


analysts (in these cases where the exact structure was not 
assumed beforehand) has been simply: when to stop factoring? 
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In psychological work there seems to have grown a divergence 
of practice between British and American psychologists. The 
former usually extract from two to four factors, the latter 
more and sometimes up to a dozen. P. E. Vernon, after an ex- 
haustive discussion of sundry methods of deciding when to stop, 
came to the conclusion that they gave such discrepant results 
when applied to the same correlation matrices that it was doubt— 
ful whether any of them effectively covered the conditions of 
such analyses; and he went on to express a view that it may 
never be possible to specify the sampling errors of centroid 
and similar techniques precisely. It would be rash to suppose 
that no further progress is possible, if only by sampling ex— 
periments with modern computing equipment; but it certainly 
seems that the problem will not yield to a direct frontal ap- 
proach of the classical statistical kind. Perhaps the most 
useful work which could be done in this field would be an in- 
vestigation into the actual distribution function of (say) 

the four largest roots of the characteristic determinant on 
certain simple non-null hypotheses about the parent. Asymp— 
totic results are hardly sufficient for the sample sizes which 
are customarily met with in practice. 


7. NOTES ON THE HISTORY OF MULTIVARIATE ANALYSIS 


7.1 The remaining chapters 8 and 9 of these notes will take 
us on to ground which, in a sense, is more familiar. It 
largely consists of generalizations to the multivariate case 
of the kmown results for univariate cases: analysis of dis— 
persion, regression, tests of hypotheses and so forth. Not 
all these generalizations are straight-forward, and some of the 
multivariate theory, e.g. that relating to discriminant func— 
tions, is of a new type. But broadly speaking few new sta- 
tistical ideas emerge. The main difference between wni- and 
multivariate analysis in such branches lies in the increasing 
complexity of the mathematics and the increasing difficulty 


of interpreting the results. 


7.2 The purpose of this chapter is to give a brief historical 
account of the development of the subject over the past thirty 
years. This is one way of getting an insight into the intri- 
cacies of the work. I shall, for convenience, divide the notes 
under three headings (a) Wilks' criterion, (b) Discrimination 
and (c) Latent roots. The three topics are of course inter— 
related and sometimes an individual was writing on all three 
subjects at the same time; but the segmentation has reason 48 
well as convenience to justify it. 


7.3 Multivariate sampling—distribution theory may be regarded 
as beginning with the publication by Wishart in 1928 of the 
distribution known by his name. This gave for p normal 
variates the distribution of the variances and covariances 

which previously had been found by Fisher in 1917 for the bi- 
variate distribution. It was latent in Wishart's work, and 

was brought out explicitly by Wilks in 1932, that the dispersion 
matrix (a,,) was the natural extension to the multivariate case - 
of the Bide in univariate theory. Consequently one line 
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of development has been the study of ratios of matrices of 
the dispersion type, this being the multivariate analogue of 
the family of results (Student's t, analysis of variance, re— 
gression tests, etc.) which in univariate theory lead up toa 
test based on a variance ratio. A second line has been the 
study of differences of matrices of type A — AB, or of maxi- 
mization under constraint which, as we noted in chapter 6, 
lead to the study of determinantal roots. A third line has 
been the measurement of distance between p-variate populations 
and the associated study of discriminant functions. No one 
of these topics has been pushed as far as it can be or will be, 
but in all three theory is in some danger of outstripping 
practice; and one would expect that a point has been reached 
where many of the results already attained need further study 
in order that they may be reduced to the possibility of nu- 
merical application. 


Note on the history of Wilks' criterion 


7.4 Wishart's distribution has itself been generalized. T. 
W. Anderson and Girschik (1944) and Anderson alone (1946) have 
considered the 'non—central! distribution, that is to say the 
distribution of the sum of Squares and cross products of a 
p-variate sample taken about some (fixed) point other than the 
sample means, The distributions are, naturally, rather com- 
Plicated but there seems no reason why they should not be 
brought to numerical application if the labour were considered 
worth while. 


7.5 One of the basic papers on determinantal ratios is that 
of Wilks (1932). Among other things Wilks found the dis— 
tribution of the ratios of two sample dispersion determinants 
based on independent samples fron identical populations (the 
analogue of the F-ratio); explicitly for p = 2 and in the 
form of an integral for greater values of pe By an ingenious 
argument of wide application he evaluated the moments in the 
general case. He then proceeded to study the distribution 

of the ratio of a dispersion determinant to one of its princi- 
pal minors. 


7.6 Approaching the subject from the viewpoint of the corre- 
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lation ratio Wilks then dealt with what would nowadays be re— 
garded as an analysis of variance between and within classes 
in a one-way classification. Suppose we have k independent 
samples from the same p—variate population. Let (a; -) be 
the dispersion matrix of all samples together and bij the 
function of means defined by 

b., = i 2, Ma Gia — B)( Fi, = T) (71) 
where n, is the number in the ath sample, n = 25 Xia is the 
mean of the ith variate in the ath sample and x; is the mean 
of the ith variate in all samples together. Let 


Cdn = Jas — ber (7.2) 


so that (c) is a "within-class" dispersion matrix. Then the 
ratio 


Le a (7.3) 


+ SS (7.3a) 


is the natural extension of the ratio of "within-class" to 
"total" variance. The ratio arises more naturally in this 
form than as a "between-class" to "within-class" ratio 


AAi 


7.7 Wilks derived- the explicit distribution of the ratio W 
(now generally known as Wilks' criterion) for p =1 and p =2, 
and the moments of the distribution for general p, in the case 
where the criterion arises as a likelihood ratio for testing 
the homogeneity of a set of means in samples from populat tons 
with identical dispersions. Finally he went on to consider 
the ratios of certain correlation determinants (as distinct 
from covariance determinants) and derived their moments. Later 
in 1935, he used similar ideas and methods to test the inde- 
pendence of k sets of normal variates. 
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7-8 Lawley (1938) also discussed the generalization of the 
F-ratio and proposed a test-function of a different kind from 
Wilks', namely 


Vr x ctl by, (7.4) 
i,j 
where (c*J) is the inverse of c Shortly afterwards the 


ije 
theory of characteristic or Latent roots began its develop- 
ment and almost at once Hsu (1940) linked it up with the Wilks 
and Lawley criteria. In fact, if the non-zero roots of 


| big =A (bij +e) | = 0 (7.5) 
are ry» see Ay, then 
m 
Lf U GEN) (7.6) 
t =1 
so A 
y = 2 (7.7) 
#=1 i SÀ 


It is the dependence of the Wilks-Lawley criteria on symmetric 
functions of the \'s and not on individual A's which makes 
their sampling properties relatively easy to investigate. In 
a further paper (1941d) Hsu showed how the general regression 
problem could be reduced to a canonical form and that signifi- 
cance tests depend essentially on the distribution of W. Since 
Hotelling's T also depends on W this was equivalent to demon— 
strating an extension of Student's t to the testing of multi- 
variate regression coefficients, 


7.9 In the meantime Bartlett (1938) gave the first of a use— 
ful series of approximations to the distribution of W in the 
null case by fitting a X? distribution to the lower moments. 
Hsu (1940) showed that nV and -n log W tend to be certainly 
identical for large m and that both have a distribution which 
can be expressed as a non—central Ne Wald and Brookner 
(1941) also gave an expression for the distribution of — log 


W and made a beginning with the tabulation of percentage points 
in special cases. 


7.10 Progress was being made rapidly at this time. It was 
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checked by the war but not for long. In 1946 Wilks extended 
his earlier work to the testing of equality of means, variances 
or covariances in a multivariate normal system. He and Tukey 
(1946) approximated to the distribution by fitting a Beta dis— 
tribution (not, as in Bartlett's case, a Gamma—distribution) 
to the lower moments and Wilks tabulated some of the percentage 
points. In the same year (1946) T. W. Anderson examined the 
distribution of the ratio W when the dispersion elements are 
distributed in a non—central Wishart form and obtained the 
moments for p = 1 and 2. (There seems on many occasions to be 
a natural barrier at p = 2 preventing extensions to higher p 
without the importation of more complicated transcendental 
functions.) Rao (1948b) later examined Bartlett's approx— 


imation. 


7.11 The problem of nuisance parameters has also received some 
attention. Plackett (1947), following some ingenious work by 
Pitman and Morgan in devising tests of equality of variances in 
bivariate populations independent of parental covariances, de- 
rived an exact test for the equality of variances or covariances 
in any number of uni- or bi-variate populations and gave ex— 
tensions to pairs of three- or four- variate populations sub- 
ject to certain conditions on the sample size. Recently G. S. 
James (1954) has considered the extension of Welch's work in 
testing the difference of means where variances are unknown. 


A historical note on latent roots 


7.12 We have noted how criteria based on determinantal ratios 
depend on symmetric functions of latent roots of certain 
matrices. Occasions often arise in which we want the distri- 
bution of one of the roots, which raises some new and complex 
distributional problems. The necessity forthis knowledge 


arises in two main ways: 


(a) in the reduction of multivariate situations to a 


‘standard! or ‘canonical’ form, as for instance in component 


and canonical correlation analysis; 


r the equality of two matrices, as for 


ing f 
(b) in testing fo [a -aB | BL a 


instance by examining whether a root of 
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near unity. This is the approach which has been developed 
by S. N. Roy since 1939. 


7.13 Let us note the formal equivalence of certain customary 
ways in which the problem arises. If (b) and (c) are matrices 
of the dispersion type independently distributed in the Wish- 
art form the canonical correlations are, in effect, the roots 
of 


| à (bi, te.) -byl = o. (7.8) 


We can, of course, write 


_ Aa 


(fy Se eee, 
A 


and derive (apart from zero roots) 


| cij tH bi; | = o. (7.9) 
If we change the sign of u and regard the sample number associa- 
ted with b;, as tending to infinity; and if we let the vari- 
ances of b be unity and the covariances zero we get 


le, -N Eo. (7.10) 


And, subject to the element of approximation induced by stand- 


ardizing sample values of cij so as to have unit variances 
we then derive 


lri -I| = o. (7.11) 


7.14 The large-sample theory of canonical correlations was 
worked out by Hotelling (1936). There does not seem to have 
been much more done in this field until the recent work by 
Lawley (referred to in chapter 7) on factor analysis. It 
would be useful to have this topic explored further. 


7.15 Attention has been concentrated more on exact distribu- 
tions in the case of normal variates and, as might be expected 
the greatest progress has been made in the null case, i.e. where 
the p variates are independent. The basic distribution here 
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is the Fisher-Hsu-Roy distribution given for a special case 

in chapter 6. This gives the simultaneous distribution of 
the roots but for p greater than 2 it is a difficult distribu- 
tion to handle because of the complicated range of variation 
of the successive \'s, which render almost impossible the in— 
tegrating out of unwanted roots. The greatest progress here 
has been made by Roy (1945) and Nanda (1948) in dealing with 
the case of (7.9) where the (a) and (b) matrices emanate from 
identical populations. 


7.16 For the non-null case expressions have been derived by 
Bartlett (1947) giving the distribution of canonical corre— 
lations in the form of an infinite series rather like Fisher's 
series for multiple R in the non-null case. (Hsu, 1941a, had 
obtained the limiting form of the distribution). The distri- 
bution is tractable when there is only one non-vanishing parent 
canonical correlation, but is stubborn for more than one ron- 
vanishing parental correlation. 


7.17 Little attention seems to have been given to the distri- 
bution of the yectors corresponding to the latent roots but 
reference may be made to T. W. Anderson (1951 b) on this sub- 
ject. 


A historical note on discriminatory analysis 


7.18 Doubtless the idea of discriminating between multivariate 
Populations could be traced far back into the past. For pre— 
sent purposes the history of the subject may be regarded as 
beginning with the work of Karl Pearson round about 1920. 
Pearson, considering anthropometric data, was led to seek a 
coefficient which would, in some acceptable sense, "measure 

the distance" between two populations. The first work to be 
published on his coefficient of racial likeness, which he de— 
noted by C?, was Miss Tildesly's on Burmese skulls (1921, Bio- 
metrika, 13, 247). 


7.19 About the same time P. C. Mahalanobis developed an in- 
terest in the subject and came to the conclusion that Pearson 
had not achieved his object. C? varied very much wath the 
Sample number and, although it provided a test of significance, 
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it did not measure the magnitude of the difference between 
two populations. Mahalanobis accordingly proposed an alter— 
native measure which he called D? and used it in 1925 to dis- 
cuss racial mixtures in Bengal. This far-sighted work was 
the starting point of research by the Indian school which is 
still in progress. 


7-20 If two multivariate normal populations have the same dis- 
persion matrix (2; with an inverse (QÍ) and means Hiss Hio 
(1 =1, 2, ««- p) “the distance between means of variates may 
be defined as ô; = Hia — Hip and Mahalanobis' generalized dis— 
tance for the population may be written 


Ae = 3 atl 6; G (7.12) 


A corresponding formula holds for the sample values: 


D2 Z atj d; d;e (7.13) 


where 
d; = ag, ote O 
If the variates are independent (7.12) reduces to 


ôi 


var Xi 


[in N 


(7.14) 


and if we scale the variates so as to have unit variances this 
becomes the square of the 'distance' between the parent means 
in the customary sense. We may also regard (7.12) as defin- 
ing an ordinary distance in a Euclidean space with oblique 
axes. It follows that, given three populations P, Q and R, 
the distance between P and Q is not greater than the sum of 
distances P to R and R to Q- 


7.21 It is to be emphasized that the definition applies only 
to normal populations with identical dispersion matrices. In 


the language of differential geometry, it is based on a 
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Euclidean metric, not a Riemannian metric. In other cases 
it may be unsuitable. 


7.22 Between 1927 and 1930 Mahalanobis had some controversy 
with K. Pearson, who defended C?. Mahalanobis (1930, Jour. 
Asiatic Soc. Bengal, 26, 541) continued his practical and 
theoretical research on D?. Pearson, though writing on his 
coefficient as late as 1936 just before his death, failed to 
find much support for his views. 


7.23 At this point Hotelling (1931) generalized "Student's" t. 
His T was, in fact, equivalent to Mahalanobis’ D?, for 


2 
D? n, n, 


lie (7.15) 


(n, +n.) (ny witty: 1) 


where n, and n, are the two sample numbers. It was some time, 
however, before this equivalence was realized, or before it was 
pointed out that in the null case the distribution of both was 
equivalent to the distribution of the multiple correlation co- 
efficient R2 in the null case. In the meantime Fisher (1930) 


obtained the distribution of R? in the non-null case. 


7.24 There appears to have been a lull in this particular field 
of statistical theory for five or six years during which the 
theory of testing hypotheses, beginning in 1928, was developed by 
by Neyman and E. Se Pearson. In 1936 Fisher published his 
first paper on discriminant functions and resolved the problem 
of testing significance. The main difference between his ap- 
proach and that of Mahalanobis was that the latter was measuring 
distance whereas Fisher was merely concerned to divide the sam- 
Ple space into two regions and allocate a sample value to one 
population or another according to which region it fell into; 

to that extent his approach was nearer to the modern theory of 
decision functions. But again Mahalanobis' D? appeared in a 
‘natural role and the approaches were obviously very close. 


D2 in the non-null case when both 
was found by R. C. 
"Studentized" the 
what is 


7.25 The distribution of 
Populations have a known dispersion matrix 
Bose in 1936. ‘Two years later he and Roy 
distribution. Fisher (1938) about the same time gave 
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effectively a generalization of D? for more than two popula- 
tions. And Welch (1939) linked up the theory of discrimin— 
atory functions and that of statistical tests in a simple 
but effective way. 


7.26 There is here a break in continuity which is not entirely 
due to war. The English school on the whole moved over to 
the study of latent roots and canonical correlations or com- 
cerned themselves with the practical problems of applying dis— 
criminant functions (Penrose, 1947 and C. A. B. Smith, 1947). 
Similar studies began in the U.S.A., as for example with the 
work of Cochran (1943), von Mises (1945), Cochran and Bliss 
(1948) and T. W. Anderson (1951 a). Further theoretical de- 
velopments are due to Roy and Rao — see in particular Rao 
(1946-1948 a, 1948 c, 1949, 1950). 


7.27 Rao (1948 c) made some interesting comments on the general 
problem of measuring distance which do not appear to have been 
followed up, possibly because tensor analysis and differential 
geometry are not very familiar fields to most mathematical 
statisticians. 


Suppose we have a multivariate population depending on u 
parameters 0, EIT O, with density function Pix, O, o e 8), 
where for short we write x to denote all the variables. We 
may set up a parameter space of u dimensions and consider two 
neighbouring points in it typified by O and O +d0. The dif- 
ferences between the probability densities of the two points 
may be writtend®@, whered relates to variations in Ọ. Let us 
now agree to measure the discrepancy between the populations 
represented by the two points by their relative difference 
dp | Q. The variance of this quantity seems a reasonable 
measure of discrepancy and becomes 


d. 2 = 
s 2 Ei, dO, a6 ,, (7.16) 
where 
DOPE  OG 
E E S EAEE 
Zij ATE 36," (7.17) 


namely, what is usually called the information matrix, The 
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form ds? is invariant and the g's are consequently a covari— 
ant tensor of the second order. We may go further and re- 
gard ds? as the quadratic differential matrix defining the 
element of length. 


7.28 Given, then, two points A and B (not necessarily close 
together) in the parameter space we can find the distance be— 
tween them by integrating ds? along a geodesic. If the 
equations of this geodesic are 


0, = Flt = 4,2) oo 0 (7.18) 


i 
where ¢ is a variable, the functions 6 are solutions of the 


equations 


d0, dO 
u 2 E3 1 
Zee +3 (jl, b) —= — = 0 (7.10) 
diet Wace te eu s  | dt dt 


where (j l, k) is the Christoffel symbol defined by 


A W yea Oy, Æji 

hy ki = Se a 
k a, 30, 06, 

Theoretically the g's are determinable from (7.17) and hence the 

O's by the solution of these equations. We can then find 

the distance by integrating ds? from A to B along the curve 

defined by (7.18). 


(7.20) 


This would be an attractive measure of distance, being 
invariant under very general transformations of the parameters, 
but the estimation of the distance raises some difficulties. 
Mahalanobis! distance is a particular case, for the geodesics 


then become straight lines. 


attacharyya (1946) also suggested repre— 


T. Ri followi Bh 
weet ee the distance be— 


senting populations on a unit hypersphere, 
tween them then being the angle 


arc cos J {flx) Y (x) }4 dx (7.21) 


where f, Y, are the frequency functions of the two populations. 
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This approach and that of the previous paragraph are both 


free from any restriction as to the shape of the frequency — 


function or the number of parameters involved. 


D 


8. TESTS OF HOMOGENEITY 


8.1 Throughout this chapter we shall suppose that the varia- 
tion is normal. We shall consider samples from k different 


-variate populations and three types of hypothesis: 


H : That the populations have the same means, the same 
variances and the same covariances; 


H, : That the populations have the same variances and 
the same covariances, irrespective of the means; 


Hi ES That, given the equality of variances and covariances, 


the means are equal. 


The results are anatural extension of homogeneity tests for k wni- 
variate normal samples based on likelihood ratios (for which 
see Kendall, Advanced Theory of Statistics, vol 2). The 
hypothesis which generalizes the ordinary analysis of vari— 
ance is H, which, as & general rule, leads to criteria with 
simpler distribution functions than those of H and H,» 


The Pearson-Wilks results for bivariate populations 


t the case of k bivariate populations 
and v, for vari- 
t Pe and ote 


8.2 We consider firs 
(p = 2). The tth population has means H 


ates x and y, and the dispersions are Cha Chey oy 
We draw a sample of n, members from it, the tota 


We ascertain two like- 


being n. Consider the hypothesis H. 
d when all the parameters 


lihoods, P(Q), the maximum likelihoo 
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are different, and P(w), the maximum likelihood under the res— 
trictions of H that means and dispersions are common to the 
populations. 


If the likelihood for the first case is written down and 
differentiated with respect to the various parameters we find 


pet Sry A Sie (8.1) 
aes LAU ate 
extent S a Opn Tae, (8.2 


where Zi and V4 are sample means and Sat Syt Mt the sample 
standard deviations and correlation. |In “arriving at the 
standard deviations we use divisors of types Ny, not ng — 1.) 


Similarly, for the second case we find 


Pape SSE (8.3) 
ela = 
oF V10 E ait Vian? (8.4) 
A es = 
oy V220 ry) Uy (8.5) 
O oy P hey E, Vio tei Va om? (8.6) 
where 
AE ee 
ae nz l 
oF 2 ky (8.7) 
TERTERA y 
= n 
o E (8.8) 
Ee int 
Mies SI z 7 Cone EAr j2 Te 


4 Son » say, (8.9) 


^ Sy Syo To» 58y,(8.10) 


alh AR Yiu To) Ean Sjo » Say. (8.11) 
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Thus S35 , S24 and ro are the variances and correlation of the 
pooled samples. Further 


k n k 
= ce = 2 (8.12) 
"+ Vata eee Ee ene: 
k 
= ra 2 5 
TeV sae 2 ng (Kz - Xo) (8.13) 
with corresponding expressions for V,2,, Vaza etc. We also 
write for each sample 
% 2 2 
y = -7 = 8.14 
Me Viat nan ai x4) A aa i E 
n 
= = )2= 2 : 
Tt Vaat ` Bh Oru Ii Ne Syt (8.15) 
$ z r 
Pe = z = s 
N ah (Noy 7 Xi) Vey y,) Ne Sgt Syt "t 


(8.16) 


If we substitute the appropriate values in the original 
likelihoods we find 


k N 
ae BO) er ee eet (8.17) 
aye Tool aay a 
whi 
ere hens Vice 
Vijt = 
Viat Voot 
2 2 2 =r?) (8.18) 
Sit Syt (1-140 
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110 120 
Le lS 
Vi20 V220 
= 2 2 2 
Sx0 Syo C= rae (8.19) 


These are, in fact, the generalized variances of the tth 
sample and of all samples together, and the likelihood ratio 
thus appears as a product of Wilks criteria. 


8.3 In an analogous way we find 


k Mia |. 2m 
SS TO eI (8.20) 
1 TEn |v | 
ija 
| Sigs a (8.21) 
ie ee H 
2 level 
jo 


and we note that, as in the univariate case 
My = hy X Dy. (8.22) 


We may conveniently indicate by suffices 1 and 2 the criteria 
appropriate to H, and H,- 


8.4 Our next problem is to find the sampling distributions of 
these criteria. This is performed by Wilks' method (we shall 
omit the details) of finding the moments of the A's, or rather, 
of L= A/m, which will serve equally well as a criterion be- 
cause it is monotonically dependent on Ae The distribution 
of L, is fairly simple, being given by 


T (n - 2) 
Gh ee VL)" -R-2 (yyy, )k-2 
S Sa a) f 2) 


dvl, (8.23) 
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The distributions of L, and L (relating to H) are more com- 
plicated except when k = 2 and approximations have to be em- 


ployed. 


8.5 We may also note that the [-criteria have certain pro- 
perties to recommend them on intuitive grounds. They vary 
from O to 1 and as they decrease from unity we are more in- 
clined to reject the corresponding hypothesis, i.e. small 
values are significant, Consider, for example 


l Y; l 
i (8.24) 


(A, )2” 

g aA 

ija ijn 
For this to be unity (V;,,) must be zero and all the sample means 
are the same. As they move further apart Àg decreases; it 

2 

is zero if and only if one of the differences of means is in- 
finite or if within-class variances are zero or f; = 1 for any 
t. Furthermore it decreases monotonically, that is to say, 
has no other maximum apart from unity. 


Example 8.1 
(Pearson and Wilks, 1933) 


Five samples are available, each of twelve members, of 


aluminium die-castings. (k =5, n, = 12 for all t, n = 60). 
On each specimen two measurements are taken; tensile strength 
(1000 lb. per square inch) which we call x, and hardness (Roek— 
well's Æ) which we call y. The data may be summarized as 


follows 
Sample Es y Corre 
ation 
Number Mead Standard Mean Standard Coe f ae 
t Deviation Deviation 
1 33.399 2.565 68.49 10.19 0.683 
2 28.216 4.318 68.02 14.49 0.876 
3 30.313 2.188 66.57 10.17 0.714 
4 33.150 3.954 76.12 11.18 0.715 
5 34.269 2.715 69.92 9.88 0.805 
tae 


We are interested in the homogeneity of these da 
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We first of all test H,, that the data show no signifi- 
cant difference in dispersions. We have the following results 


Sums of Squares Sum Generalized 
t Products Variances log, lV; zel 
sag ™oot Viot Vest | 
1 78.948 1247.18 214.18 365.204 2.56254 
2 223.695 2519.31 657.62 910.401 2.95923 
3 57.448 1241.78 190.63 243.029 2.38566 
4 187.618 1473.44 375.91 938.451 2.97241 
5 88.456 1171.73 259.18 253.281 2. 40360 
Totals 636.165 7653.44 1697. 52 13.28344 


Hence we find 


2 arb fk 
Visa = go (636.165) 
= 10.6028 

Vaza = 127.5573 


Viza = 28:2920 


[Vijal = 552.018 
log L, = {1/k log Il DA — log Iv; jal} 
= 1,914,734 
giving L, = N 68217, 


We test, generally, H with 5(k 1) degrees of freedom, 


H, with 3(k — 1) and H, with 2(k — 1). There is a useful 
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general theorem (due to Wilks) of which a particular case is 
that for large samples — n log,L, is distributed as x? with a 
number of degrees of freedom equal to the number of constants 
fitted in less the number fitted in w, in this case 3(5 — 1) 
= 12. This gives 11.78 as Xe with 12 d.f. which is not sig- 
nificant. A more exact test confirms this. 


We therefore accept the equality of dispersion and pro— 
ceed to test the means by hypothesis H,. We now have a gen— 
eralized analysis of variance. 


D. F. S. S. (x) S. S. (y) S.P. (x,y) aetan cest 
Between 
samples k-174 306.089 662.77 214.86 |»; jnl=43.528 
Within 
samples n-}=55 636.165 7653.42 1697.52 |», jal=552.018 


Totals  n-1=59 nv o942.254 mp, -8316.19 m4, o=1912.38 [v ;jol=1180.77 


ly, Xl Vijal / IV; jo! } = 0.4750 


We can apply an approximate test to — n log Le, in the 
same way. The quantity is 44.59 with 13-518 d.f. This is 
highly significant. In this particular case an exact test 


is available, the ratio 


ow is n-k-=-1 


V Lo k-1 
being distributed as an F ratio with 2(k — 1) and 2(n-k-1) 
degrees of freedom. This also rejects the hypothesis, the 
ratio being 4.95 with 8 and 108 d.f. (We shall discuss 
later the cases in which exact F-ratio tests are available.) 


conclude that there is heterogeneity in the mean 


We i 
values We therefore proceed to test x and y separately: 
Estimates of variance d.f. 
x y 
4 
Between samples 76.522 165.69 
11.566 139.15 55 


Within samples 
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A simple F ratio test shows that at the 1% point the dif- 
ferences between mean strengths are significant, but not those 
between hardness. 


The conclusion is that although the samples are under 
control as regards dispersions and hardness they are not un— 
der control as regards mean-strength, although hardness and 
strength are fairly highly correlated. One would, I think, 
be led to consider the accuracy and nature of the methods of 
measurement before drawing conclusions about the materials 
themselves. 


Example 8.2 
(Pearson and Wilks, 1933) 


Measurements were available on the lengths and breadths 
of 600 skulls, 20 from each of 30 races. (The details are 
given in the paper under reference). That there would be 
some variation between races was to be expected but it was of 
interest to see whether it extended to dispersions. The hy— 
pothesis H, was tested. The authors found 


| Vija | = 656.360, 1/k Z tog {| Vije |} = 2.644429 
log | Vija | = 2.817148 
Difference = {.827281 
Lı = 0.6719 
The simple test gives 600 (.39765) = 236.8 distributed as x? 
with 150-63 = 87 d.f. This is highly significant. 


There is therefore lack of uniformity in the dispersions. 
We now proceed to consider them individually. 


The homogeneity of a set of variances in the univariate 
case can be tested by known methods, which are, in fact, 
simpler forms of the likelihood criteria we are now employing 
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here. It is found that the variances of both x and y are 
significantly heterogeneous. Finally we consider the corre— 
lations r. A convenient test of homogeneity here is obtained 
by transforming the 30 observed r's by the formula z = tanh “tr 
and testing 


30 
ye Ss in (n, - 3)(2, — 2)? (8.25) 


with 29 degrees of freedom, In the present instance this led 
to xX = 96.01 which is very significant. 


The general conclusion is that there exists racial hetero- 
geneity in variances and correlation, the latter being apparently 
less uniform than the variances themselves. 


In a case of this kind we should probably not wish to pro- 
ceed to test heterogeneity of means; but if we did so wish, we 
should proceed as follows: since H, is untenable we should not 
test H3. We revert to tests of type H, on the variates x and 
y separately. We can then proceed on the basis of univariate 
theory, by the Behrens-Fisher test, or by assuming that hetero- 
geneity is not important enough to invalidate an ordinary 


F—ratio test. 


8.6 Methods of a parallel kind could be followed for the test- 
ing of k samples of p—variate populations, although I am not 
aware that the general case has been worked out explicitly. 
Wilks (1946) has, however, considered problem of a very 
similar character for the p—variate case. We are given a 
sample of n from one p—variate population with means H; and 
dispersions Pije We consider the hypotheses 


H: that the means are all equal anå the corresponding 
dispersions all equal for each variate; 


Hy: that the dispersions are equal regardless of the 
means; 


Ha: that, given the equality of dispersions, the means are 


all equal. 
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Let 
n 
Be = GU peas (8.26) 
— $ — 
T F, (8.27) 
n 
= al a y 
S44 = i/n ste it - Ki Vxs, ~%4) (8.28) 
3 p 
s? = E/E Ace (8.29) 
$ 
ae ut DUS (8.30) 
PE -1) i7j= "I 
The hypothesis H to be tested is 
Wy = u, (8.31) 
(4 49404) =i 02 [nok Fe renner 
po? Congo (oF 
po? pora falish sare gos (8.32) 
and the likelihood is 
lati?” 
p [ Sa silat? ] 
5 exp - 1/2 > ati (x, — eA) 
(amare Paea i ie Mega 
(8.33) 


where (tÍ) is inverse to (P; 0 49;)- 
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We maximize P for unconstrained values of the parameters 
to find 


ey ES í (8.34) 


ati = stl (8.35) 


" 


and then, on substitution in (8.33) obtain 
e 2m 
QO) 3 EA (8.36) 
lsat (am) 2" 
We now invert (8.32) to find, on H, 
atj = A A eaa T 
B I TED, 


. OEA 


B BERTE KA (8.37) 
where 
Ag Lily wtih TEORA (8.38) 
62 a-p) 2 + (P-1) P 
i = z$ (8.39) 


_ 
o a-p (1 + (b-1) oh 

for the likelihood, on substitution in (8.33) 
[A — BP? {4 + (P= 2) piir 


This gives us, 


PRESaYa = ee 


- m]}. 
(8.40) 


exp {-1/2 [A E È (x; - 1)? +8 Z Ziri - WX Ge 
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We now find the maximum likelihood estimators 


iS 83 5 (8.41) 
+ (pb - 2)70 
(Re ARO Re eae (8.42) 
s2 (1 —1,){1 + ( - L)rg} 


Se 


BESTO (8.43) 


sg (1-41) {1 + Ero) 


where 

Sijo = Siz + @,-Z);-*), (8.44) 

p 
s2 = 1% Be , Stio’ (8.45) 

; sS. 

440 
rf, = Bhs ie u (8.46) 
(P- 1) 3449 


On substitution in (8.40) we then find 


e~2n 
p(w) = (8.47) 
[ts8)P (1 =r 4 {1 +  - rr Hlne 


We take on our ratio Ly 


3 Is gl 
h2/n = w ~ ij 
n 


——— aĖĖ— MM 


SIE a {a + = aro} 
(8.48) 


The denominator is what we get by inserting r = ToS = So in 
the matrix of sample dispersions, so that the ratio is, in 
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fact, a Wilks' ratio of the usual type. 


8.7 As before, we require the distribution of the criterion 
to make a test, and it is arrived at in the usual way. The 
approximate test states that —7 log L is distributed as xX? 
with p(p-—1)/2+p+p-3 = p(p + 3)/2 —3 degrees of 
freedom. The corresponding criteria for the other hypotheses 
are 


Is; ,| 
ae Z (8.49) 
ĉit (1 — rj? {a + (p - dr} 
s?(1 — r) 
bets (8.50) 


A EE a 
S ) pi =a 


On the approximate test — 7 log L, is distributed as y? with 
ap(p +1) -—2 df. and-m (b—1) log lp as y2 with p-1 
d.f. The reason for the factor p — 1 in the latter case is 
that the criterion Lọ is the 1/2n (p — 1)th root of the i 


ratio. 


Example 8.3 
In this example (Wilks, 1946) certain details not relevant 
to the method have been omitted. 


A test involving 60 items was given to 100 subjects. The 
test items were divided into three groups of 20 on the basis 
of external criteria and a score in each of the three groups 
obtained for each individual. We may then regard the scores 
as 100 observations on a trivariate normal population (p = 3, 
n = 100). The following values were found : 


X, = 10.9900 Si 16.8451 S2 = 13.5493 
X, = 10.9300 Spo = 18.1099 Sy, 7 14-5826 
X, = 11.2600 Spa 7 17.7134 Są = 13.8056 


130 


s? = 17.5558 sô = 17.5764 
r = 0.7963 To = 0.7943 
S;j = 545.5308. 


The data look homogeneous, that is to say, it appears that 
the three sub-tests of the original test of 60 items are 
"parallel". We proceed to examine this hypothesis. 


Consider hypothesis H. The criterion of (8.48) becomes 
0.9209. The 5% point of -100 log L for 3 d.f. is 12.5912, 
giving for the 5% point of L itself 0.8817. The observed 
value is considerably greater than this and we accept the hy— 
pothesis. 


Had it been otherwise we might have gone on to test Ay. 
The criterion of (8.49) becomes 0.93870. The 5% point of —100 
log L, for 4 d.f. is 9.4877 giving a value of 0.9095 as the 5% 
point of the criterion itself. The observed value is not 
significant. 


Finally, for the criterion (8.50) we find 0.9914. A 
test is hardly necessary but if we carry it out the result is 
not significant. 


We conclude that the three sub-tests are similar in 
respect of means and dispersions. 


The analysis of dispersion 


8.8 In ordinary variance analysis, tests based on the F-ratio 
presuppose that error variances are equal. This corresponds 
to the hypothesis which we have called Hj, and that hypothesis 
is accordingly the natural extension to multivariate analysis 


of the univariate analysis of variance. We must here sound 


a warning about a point which has already occurred in Example 
82e In variance analysis we often assume equality of under- 
lying variance, sometimes without realizing it, and disaster 
is apparently averted by the fact that the tests are not very 
sensitive to small departures from equality. A parallel 
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assumption that p—variate dispersions are equal is more 
hazardous. The point has never, so far as I know, been 
decisively investigated, but the indications are that tests 
may be somewhat sensitive to differences between parent co~ 
variances, though not perhaps to differences between parent 
variances. To leap at once to hypothesis Hj before testing 
H or H, is dangerous unless we have prior knowledge about the 
parental dispersions. 


8.9 The analysis of quadratic sums on the basis of hypothesis 
H, is sometimes known as the "multivariate analysis of vari- 
ance". This is a rather clumsy expression which it may be 
better to avoid. I shall call it the analysis of dispersion 
and following Rao (1948a) shall draw a useful distinction be- 
tween the analysis of dispersion and the analysis of covariance. 
In the former we are concerned with the effect of classifica— 
tion on a set of interdependent variates, as illustrated in 
Examples 8.1 to 8.3. In the latter we also have the effect 
of classification on a set of variates but we are interested 
primarily in the effect on one of them. The others appear 

as disturbing influences and are removed from the principal 
variate by regression techniques. The distinction is much the 
‘same as was drawn earlier between interdependence and depen— 
dence. In the general case we may have a complex of pre 
variates and wish to study the first p after the effect of 

the other g has been removed. The removal of the concomitant 
the g variates is the function of covariance 
After it has been performed we may carry out a 

n the adjusted p variates, this reducing 


variation of 
analysis. 

dispersion analysis o: 
to a variance analysis if p = 1. 


8.10 Consider now hypotheses of type H2 for the p—variate pop- 
If we have k classes the analysis of dispersion may 


ulation. 
be put in the forn 
d.f. Matrix 
Between classes k-1ı (Sijm? say 
Within classes n-k (S444) say 
Total = Hk (Sijo? say. (8.51) 


An analysis on the lines of the foregoing leads toa criterion 
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Na of which the 2/nth power is 


Is; 4g! 

i 

y= — (8.52) 
Is jel 

as in (8.21). This may be tested by taking -n log W as ¥? 

with p(k - 1) degrees of freedom. In fact, on Ho the dis- 

persions are common to both O and w; for the former there are 

pk means and for the latter p means; the number of d.f. is 

thus pk — p. 


A similar situation arises if, instead of estimating means 
we estimate linear functions of them. This result concerning 
the so-called linear hypothesis enables us to apply (8.51) and 
the associated test whenever there is a matrix of g Ce) 
degrees of freedom split off from the total matrix and inde- 
pendent of the residual. 


8.11 A refinement in the significance test has been proposed 
by Bartlett, namely that 


- {v - 4(p + &)} (8.53) 


should be taken as X? with p(k — 1) d.f. Here V is the number 
of degrees of freedom in the total dispersion, our ^ —- 1. 


8.12 For certain small values of p and k these are exact tests 
in the F-distribution. They are as follows 


Variance ratio d.f. 
y IEW n= pi 
i Se p avis ear band n-p—-1 


k= 3, any p VW FEST ep and 2 (n — p - 2) 
y 1-W n-k 
p =l, any k ir ial k-1andn-—k 


p =R, anyk ——— ——__— 2(k - 1) and g(n -k - 1) 
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Box (1949, Biometrika, 36, 317) has reviewed and 
re-examined the distribution theory of criteria of the like— 
lihood-ratio and has obtained better approximations in certain 
cases. Box's paper contains a number of references to earlier 


work. 


8.13 More complicated situations involving manifold classifi- 
cation and regression can be dealt with by the same technique, 
provided that we can conduct the analysis so as to emerge with 
two dispersion determinants independently distributed in Wish- 
art's form; the ratio of one to their sum then follows the 


W distribution. 


Example 8.4. 
(Bartlett, 1947 a) 


In an experiment to examine the effect of fertilizers on 
grain 8 treatments were applied in each of 8 randomized blocks; 
and on each plot two observations were made, the yield of 
straw (x,) and the yield of grain (%_). The following was 


‘obtained; 
d.f. 8.S. (x7) S.P. (x,x,) 8.8. (x3) 
Ne eS ee eee 
Blocks 7 86,045.8 56,073.6 75,841.5 
Treatments 7 12,496.8 6,786.6 32,985.0 
Residual 49 136, 972.6 58,549.0 71,496.01 
Total 63 235,515.2 107, 836.0 180, 322.6 


sted in block differences and extract 


We are not intere 
This gives us the middle two lines 


them from the variation. 
of the table and a new total: 


Total 


(excluding 
blocks) 56 149, 469.4 51, 762.4 104, 481.1 
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Thus 
136, 972.6 58,549.0 
58,549.0 71,496.1 
W = _ aaaaaaaħŮÁ 
149, 469.4 51, 762.4 
51,762.4 104, 481.1 { 
= 0.4920 | 
| 
The quantity — {56 - 4(8 + 2)} log W = -51 log 0.4920 = 36.2 


is distributed approximately as X? with 2 X 7=444.f. It is 
significant, being about on the 0.14 level. | 


We can proceed to further analysis in several ways. First, 
in Example 8.1 we can examine the significance of the variates 
separately by an ordinary F-ratio. This gives 


Estimated Mean Squares def. 
Xa X2 | 
Treatments 1785 4712 Z | 
Residual 2795 1459 49 


The first ratio 1.565 for 49 and 7 d.f 
not significant. (We should have been hard put to it to ex- 
plain a significance if it had appeared, for the residual mean 
Square is larger than the treatment mean square). The second 
ratio is 3.230 for 7 and 49 d.f. which is significant at the 
1% point. At first sight it would seem that the treatments 
are affecting grain yield but not straw yield. 
tion "between treatments" is about —0. 
residuals" is about +0. 


- is fortunately 


The correla- | 
3 and that "between 
6, but the former is not significant. 


We could also perform a covariance analysis to see if the 
treatment on straw was influenced by (and perhaps masked by the 
concomitant variate grain). Our table is 
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af. SiS 
S.(x,7) S.P. (% x3) S.5. (%,7) 
Tre 
atment 7 12,496.8 -6,786.6 32,985.0 
»985. 
Resi 
sidual 49 136, 972.6 58,549.0 71,496.1 
»496. 
Tot 
al 56 149,469.4 51, 762.4 104,481.1 


Consi 
uated the regression of x; on Xp we have for the coeffi- 
the regression equation (calculated from the residual 


items) 
S(x,%,) 


5 eS Oe 
aA 18,911,8 


b, say, 


The residual total S.S. (x, ) is then - 


136,972.6 — 58,549.0 b = 89.026,1 


Similarly for the "total" sums of squares and products 


1 51,762.4 
DA E = =e = 0.495,423,6 
104,481.1 
and the residual sum of squares is 
1 
149,469.4 — 51,762.45 = 128, 825-1 


We then construct the table for residual %4 effects after the 
line by subtraction: 


extraction of Xo, obtaining the treatment 
Mean 
Treatment 7 34, 799.0 4971 
1855 


Residual 48 89,026.1 


Total 55 4123, 825-1 2251 


of the residual to the 
est we can equally well 
ibution with 7 and 


est the ratio 
ivariate t 
FF-distr 


s ‘ 
te really require to t 
ie but as this is 4 W 
est 4971 / 1655 = 2.68 in the 
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48 d.f. This is significant at the 5% point. We should 
infer that the data cannot be wholly explained as an effect on 
X2, the grain yield. There appears also an effect on the 
straw yield which is obscured if we consider that yield by 
itself, owing to the correlations between the variables. 


It may be as well to explain the basis of this covariance 
analysis. If the jth observation on the ith treatment is 
Yij our model is 

Yay = T+ B2y; 
where, for simplicity, y relates to the x, variable and X to 
the x2 variable. Our hypothesis is not that B = O but that 


each tps O, namely that the y's are homogeneous apart from 
the concomitant X's. 


On the hypothesis that Ue O the estimate of B is given by 
S(y X) / S(X?) over the whole sample. This we called b’ in our 


Present example and the corresponding residual S(y?) = b'Sly Xx) 
we may call Rpe 


If T; is not zero, the equations of estimate are 
Ohya Site EOSL O N = 10 
j tj 7 i iad 
where t; is the estimator of Ti, giving 
ES Soaps 


where Y;, is the mean of Yije A second equation of estimate 
is 


SMEO ES O L = 2) = 
A ig Fay) b S(x?) o 


’ 


and the residual is 


S(y?) ~ St, Yy) =b Sty X) 


S(y?) =S p, or VEO Sty X) 
Sg Og, =op =b Soy Ce -7 


Sij =F -b5 Yay = Nyy = 4) 


137 


since 


S (y,; ¢ - y, = ? = 
ey Yi) Oj aY ee o 


This residual, which we may call Ry» is the "within-class" sum 
as calculated in the example. It is the ratio R} / Rp which 
we have, in effect, tested. 


Example 8.5 


The case of the Egyptian skulls. (Barnard, 1935, Ann. 
Eugen. Lond., 6, 352; Kendall, Advanced Theory, vol.2, Chapter 
17; Bartlett, 1947a; Rao, 1948b.) 


Some data by Miss Barnard have been examined and re-ex— 
amined by several writers on multivariate analysis and we shall 


discuss them again here. 


Barnard had four series of skulls, 91 from late Predynas— 
tic, 162 from the Sixth-to-Twelfth, 70 from the Twelfth- and 
- Thirteenth and 75 from Ptolemaic dynasties. On each four 
Measurements were taken: x, = maximum breadth, X2 = basial— 
Veolar length, x, = nasal height, 4 = pasibregmatic height. 
The main problems were (a) to construct a discriminant func 
tion and (b) to discuss the possible changes in the skull- 
conformation over time. 


(As an additional source of confusion, note that Kendall 
in earlier editions misdescribed the measurements, Bartlett 
did so but corrected his 19478 paper before final printing, and 
Rao, 1948b and in his book, has the variates in the wrong order 
with a misprint in the mean of X2 in group II). 


The means of the series as given by Miss Barnard are 


I II III Iv 

My > OL no = 162 ng = 70 ng = 75 

Xi 183.582,418 134.265,432 134.371,429 135.306,664 
Xo 98.307, 692 96.462, 963 95.857,143 95.040, 000 
Xa 50.835,165 51.148,148 50.100, 000 52.093, 333 


X4 133.000, 000 134.882,716 133.642,857 131.466, 667 
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She gives the following sums of squares and products within 
series (394 d.f.). The column numbers refer to variates, 


1 R 3 4 


at 9661.997,470 445.573,301 1190.623,900 2148.584,219 


2 9073.115,207 1239.221,990 2255.812,722 

3 3038.320,351  1271.054,662 

4 8741.508, 820 
(8.54) 


The matrix for the whole observations (397 d.f.) given by 
Bartlett (1947), and obtained by adding back to (8.54) the 
matrix between series is: 


1 2 3 4 


1 9785.178,098 214.197,666  1217.929,248  2019.820,216 


2 9559.460,890 1131.716,372 2381.126,040 
3 4088.731,856  1133.473,898 
4 9382.242,720 

(8.55) 


Finally, the matrix between classes (3.d.f.) is 
1 2 3 4 


1 123.180,628 -231.375,635 87,305,348 -128.768,904 


2 486.345,863 ~107.505,618 125.313,318 
3 100.411,505 -137.580,764 
4 640.733, 891 

(8.56) 


It is as well to look over these results before proceeding. 
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We note from (8.56) that x, and x, are positively correlated 
between classes, as are xo and x4; but that either member of 
the first pair is negatively correlated with any member of 
the second pair. 


We may, first of all, consider whether there are any 
significant differences between series on an over-all testa 
The appropriate criterion is the ratio of the determinants of 
(8.54) and (8.55), namely 


2426.898 
= Os Gola 


2954.474 
-393 log L = 77.3 and the number of degrees of freecom is 
3*4=12. ‘This value is highly significant and the dif- 
ferences between series we may therefore take as real. 
We might now ask whether x, and x, contribute to this 
difference independently of the concomitant variation due to 
X, and x3, This involves extracting the regressions of %3 


and x, on x, and Xo from the former and testing the residual 
matrices, 


If y and x are vectors related by 
yy OS oc ee 
the regression a is estimated by 
Etyx') {Ex} ™ 
and the variation due to regression is 
394 E(yx’) {E(xx')}+ Eon 
In our present case, from (8.54) 
9661. 997,040 445.573,301 


E(xx') = — 
394 | 445,573,301  9073.115,207 
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the inverse of which is 


1.037, 332 —0.050, 942 
0. 050, 942 1.104, 659 


394 x 1074 x 


The variation due to regression (factors in 394 cancelling) is 
then 


= 1130. 623, 900 1239. 221,990 1.087,382 —0.050,942 
a 2148.584,210 2255.812,722| |-0.050,942 1.104, 659 


1130. 623, 900 2148.584,210 
x 
1289.221,990 2255.812, 722 


287.967,620 pees | 


534.238, 796 991.621, 041 (8.57) 
Subtracting this from the matrix of xg and x4, viz 

3938. 820, 351 1271.054, 662 

1271.054, 662 8741.508,829 
we get the residual 

3650. 353,731 736.815 ,866 

736.815, 866 7749. 887,788 
with 394-2 = 392 d.f. (8.58) 


Similarly, operating on (8.55) for the total dispersions 
we find the residual 


3809. 335,190 611.698, 381 
611.698, 381 8393. 755,848 (8.59) 


with 397-2 = 395 d.f. 


141 


The ratio of the determinants of (8.58) to (8.59) is 


0.277, 469 
L = ———— = 0.8781 
0.316, 008 


— 392 log L = 51.39 and the appropriate number of degrees of 
freedom is 3 X 2 = 6. The value is highly significant and 
we conclude that x, and x4 are "relevant" variates, in that 
the differences between series cannot be assigned to x, and X3 
alone. Another way of looking at this will emerge when we 
consider discriminant functions. 


One of the topics discussed by Miss Barnard was the use 
of these measurements in discriminating between the four 
periods and the possibility of the variates having a linear 
regression on time (that is to say a linear trend which might, 
perhaps, be different for the four variates). The intervals 
between the four series were taken in the proportion 2, 1, 2 
and we may conveniently take the values of t as -5, -1, 41, 
+5. On this basis the sum of products of X, Xos Xa, %4 
with time t — T are 


718.762,86, -1407.260,75, —410. 101,94, -733.427,58 and 


Z(t - T)? = 4307.663,32. (T is not zero because the numbers 
in classes are unequal, and its value is —0. 432,161) 

We are now examining, not the regression of some Madea) 
on others, but the regression of all on the extraneous varta á 
time. The sums of squares and products due to regression 
(1 d.f.) are 


1 2 3 4 
1 119.930,358 -234.810,812 68.428,2385 —122. 377,258 
Z 459.734,449 -133. 975,163 —149. 601,596 
Y 39.042,852 —69.824,358 
$ 124.874, 099 


(8.60) 
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Here, for example, the item in row 1 and column 1 is 


- 2 «762, 867 
{3x, (t - t)} _ __718.762,86? _ aces 
X(t — T)? 4307. 668,32 


anå that in row 1 and colum 2 is 


2x, (t-T) Zx, (t-T) _ 718.762,86 X -1407.26 Q 75 
(t-77)? 4307 .668,32 


= —234.810,812 


The residual after removing the regression on time from the 
original matrix is given by subtracting (8.50) from (8.55), 
giving (396 d.f.) 


rT ` 2 3 4 


1 9665.247,'740 449.008,478 1149.501,013 2142.197 9474 


2 9099.726,441  1265.691,535 2231.524, 444 
3 4049.689, 004 1203.298,256 
4 9257.368,621 

(8.61) 


We now consider whether this residual is homogeneous, 
testing the variation within series as given by (8.54) — 394 
d.f. against (8.61). The criterion is 


10*? x 0,242,601 


L => = 0.0081 
10*? X 0,268,738 


The multiplier of — log L is 396 — 1/2(2+4+1) = 392 1/2, and 


the d.f. number 2X4 = 8, -392 1/2 log .9031 = 40.02 which 
is significant. 


We conclude that either (a) the regression on time is not 
linear or (b) that there are additional sources of variation 
between series which are not temporal effects, 
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8.14 In conclusion reference may be made to the work of Roy on 
the comparison of two dispersion matrices of the same order p. 
If the matrices are S, and S, the roots of 


MaN eE (8.62) 
which, for non-degenerate cases, are those of 
siss Aa |= S10 (8.63) 


will have roots near unity if they emanated from the same 
Populations. Roy proposes to reject the hypothesis of equal— 
ity if, when the roots are arranged in ascending order 


ORSENT RS Oto SS Ns E (8.64) 


$ 4 
Mg 2 Xo and / or A, < Nes where A, and A, are given by 


' 
POA, 2A) = P(A, $ AQ) and (8.65) 


1-PA, <r, <r, Ag) = say, 0.05, (8.66) 


1 


the probabilities being calculated on the null hypothesis. 


The same type of test may be applied in dispersion 
analysis. If the "mean-square" matrices are S, and Sz ; 
based on k — 1 and n — k degrees of freedom,S, will be positive 
Semi-definite and of rank g, say, where q is the smaller of p 
and k — 1; while S2 is positive definite. The roots of 
(8.63) will have p — q zeros and q non=zeros, and the latter 
may be tested in the usual way. Although great progress has 
been made with the distribution problem tables only began to 
become available in 1957. (See Foster, F.G., Biometrika, 45+) 


9. DISCRIMINATORY ANALYSIS 


9.1 Suppose that we have two p—variate populations of a 
similar kind which "overlap" in, the sense that certain members 
can be observed which might have arisen from either population. 
There will be occasions when, confronted with a member and 
noting its variate-values, we are uncertain from which popu— 
lation it emanated. We will suppose that we have to make up 
our minds on this question and require to allocate it to one 
or the other of the parents. By what rule should we proceed 
so as toymake as few mistakes as possible over a large number 
of similar occasions ? Questions of this type give rise to 
discriminatory analysis, the general object of which is to find 
rules of behaviour in the assignment of individuals to pre- 
determined classes with optimal properties. 


Note two things: (a) the classes are predetermined. 
It is not the object of the inquiry to find what is the best. 
way of dividing heterogeneous material into populations or 
classes; these are already decided. 


(b) We shall, in general, consider the 
assignment of a member to one population or the other: ex- 
tensions of the theory which leave room for suspended judge- 
ment will not be considered, though they are quite possible. 


9.2 We shall consider in the first instance the case where 
there are only two parents; and we shall seek for some func— 
tions of the variates (linear for simplicity if possible) 
which will assume widely different sets of values for the two, so 
that a sample observation can be allotted to the appropriate 
population according to the value which it gives to the func- 
tion. We shall then consider the more extended problems 
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(a) when there are k parents and (b) when we allow ourselves 
several discriminant functions, not merely one. 


9.3 When we allot a member to one of two populations, A, and 
Ay, we may make two kinds of mistakes, according to the popu- 
lation to which we wrongly assign a given:member. We shall 
suppose that these two kinds of mistakes are equally important. 
(In the contrary case we should have to weight the errors pro~ 
portionately to their importance, as in the general theory of 
decision functions). We shall also assume that we require the 
two kinds of mistake to occur with equal, proportional frequen- 
cy for each population, and that their sum should be a minimum. 


Consider a space of p dimensions in which a sample member is re— 
presented by a point. The populations may be imagined as 
clusters of points (or densities in the continuous case) con- 
dersing round two central values. We wish to set up a 
boundary, or rather a region R in the space such that, if 
fı dx, f, dx, represent the elements of frequency of A, and A, 


ace =e ents 
R 1-R 
= 4 — fa dx (9.1) 
R 
This is equivalent to 
Soy +fo)dx = 1 (92) 
R 
We require, subject to this condition, to minimize 
Aa (9.3) 
R 


This is equivalent to finding an unconditional minimum of 


J Ge = a f ox 
R 
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or equivalently again, of 


f (Bf - fy) dx 


where the constant is determined by (9.2). This is clearly 
achieved by taking into R all those points for which Bf, — f. 
is negative, Thus the boundary of R is given by 
fy 
fa 


1 


= 8 (9.4) 


9.4 From this point of view the discriminating boundary 
arises naturally as determined by the probability ratio. Any 
point for which f/f > B is assigned to f,, and in the con- 
trary case to fo. The probabilities of misclassification 
either way are then equal and minimal; and each is equal to 


Hl Vey oe =a Fi, dxs (9.5) 
filfp > 8B falfa > 8 


Formally, at least, this solves the simplest version of our 
problem. 


9.5 Now suppose that the two populations are multivariate 


normal with means Hij» H g and identical dispersion matrices 


Oije The logarithm of fhe likelihood ratio, apart from con— 
stants, is 


Me So qty 
TE gpa OE Mey — Wg) — Oey = yg l ey = gd 


The second part of this expression is a constant and without 
losing generality we can take as our function 


Bold (Wy = yg) xy. (9.6) 


This is the parental form. In practice we usually do not 
know the parameters but have to estimate them from a sample 


147 


for which the parents are known. Inserting the sample values 
in (9.6) — they are maximum-likelihood estimators of the 
corresponding parameters — we have for the discriminant func— 
tion ¥ 


= tj Vie pai 
X: È atl Tij Faj) Xi (9.7) 
9.6 The same result may be reached by a different route. 
Suppose that we determine a linear function 


; l 
Ae = yD) aia 
jz. I 1 
80 as to maximize the square of the difference of its expec- 
tation in the two populations, divided by the variance of x 
(which, by hypothesis, is common to the two populations). 
That is to say, we maximize 


2 Ly (j= Hey) IF 
Bfe Ly Qij 


A differentiation with respect to t; gives us 


p 
y = Haj) = { 2 es (Hy 5 - hj)? Zl; a, /2Xa,; l; L; 


j Jj 

from which we have 
tj ZN 
ly © Za“ (hy Hy) 


leading back to (9.6). Since our function X is used only to 
Separate the two populations, not to measure the distance be— 
tween them, we may multiply it by any convenient constant. 


Example 9.1 

(Fisher, 1936) 
owers from each of two 
Four measurements 
1 length and petal 


Measurements were made on 50 fl 
SPecies of iris, setosa and versicolor. 
Were made, sepal length, sepal width, peta 
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width. We denote them by the suffixes 1, 2, 3, 4. 


The sums of squares and products about the means were (in 


1 2 3 4 
1 19.1434 9.0356 9.7634 3.2394 
2 11.8658 4.6232 2.4746 
3 12.2978 3.8794 
4 2.4604 
(9.8) 
The means were 
Versicolor Setosa v-s 
1 5.936 5.006 0. 930 
2 2.770 3.428 —0. 658 
3 4.260 1.462 2.798 
4 1.326 0.246 1.080 (9.9) 4 
The matrix inverse to (9.8) is (cm.2) 
al 2 3 4 
1 .188,716,1  -.008,866,6 —.081,615,8 .039,635,0 
2 „145,273,6 .033,410,1  .110,752,9 
3 «219,361,4  —.272,020,6 
4 «894,550,6 
(9.9a) 
Using (9.7) we then find, for the coefficients 
la = (.118,71651) (0.980) — (.066,866 ,6) (-.658) 


= (.081,615,8) (2.798) + (,039,635,0) (1.080) 
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= -0.031,151,1 
l, = -0.183,907,5 
la = 0.222,104,4 
l, = 0.814,787,4 


We may conveniently choose multiples of these coefficients so 
that the coefficient of x, is unity. We then find 


X = x, + 5.908%, — 741290r, — 10.1036, (9.10) 
The mean value of Y for versicolor, obtained by substituting 
from (9.9) in this, is —21.4815; and the mean for setosa is 


12.3345. The difference is 33.8160 cm. 


From (9.6) we have, relating X to the original constants 


l 
X= Dat (u, j - Hogs = Alix; 
Then 
var X¥ = 21; l; cov (x 4,% 5) 
= 21; Ly dij 
T put Z adl y - Hap) 


2l; Wi- 


Rae 


2 
We may then estimate var X from the difference of the observed 
mean values. In our present case the variance of X of (9.10) 
is given by 


33.8160 A 


Cen l oen 


(0.031,151,1)7 


where n is the number of degrees of freedom of the estimate 
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and is 98. Thus var Ñ = 11.08. This is the variance of a 
single value. The variance of the difference of two means, 
each of 50 numbers, is 1/25 th of this, namely 0.4432, giving 
a standard error of 0.664. The observed difference of means, 
33.816, is more than 50 times as great as the standard error 


and we conclude that the discriminant is likely to be effec— 
tive. 


The probability of misclassification is easily estimated. 
It is the integral of the multivariate form over a region to 
one side of the plane X = constant, which is easily seen to be 
the tail of a normal distribution function of ¥. The standard 
deviation of X is V11.08 = 3.32 and the distance between the 
parent means in the Ñ — direction is 33.816. One half of this 
is 16.9 and we require the tail area of a normal error in ex- 
cess of 16.9 / 3.38 standard deviations from the mean. This 
is negligible. 


Example 9.2 


(F. Heincke, 1898, Naturgeschichte des Herings, Berlin; 
+ Re Zarapkin, 1934, Arch. Haturgesch. 3, 161; G. 
Beall, 1945, Psychometrika, 10, 205; L.S.Penrose, 1947) 


Biometricians have often proposed to discriminate between 
individuals on the basis of "size" and "shape", Consider the case 
where measurements are made on an organism and the correlations 
between them are positive and all equal, say, tor. The 
correlation matrix is then the one that we considered in Ex- 
ample 2.2 and has latent roots A E y E i) LAIN 560 
=h,=1 -r. From the componen 
means that there is one principal 


are isotropic. The component cor: 


t analysis approach this 
direction and that the others 
responding to À; , 


E = Rex. (9.11) 


We take a "size" component proportional to this and write 


Q 


i 
M 
ak 
[i 
fay 
RY 
> 


(9.12) 
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so that 
var Q = pd, = ptit- 7}. (9.13) 
Among the remaining variation no particular direction is 


suggested as suitable. Let us take a set of weights w; with 
non-zero mean w and define a shape component by 


p ; 
Piss) 2 x, (9.14) 


We have at once 


wad =) (ea) 2 Cis oy (9.15) 


Where the x's are taken to have standard measure. 


Also 


=- wW 
cov (Q,P) = cov { pie zx; } 


Z w, -w 
3 EAT x,t} 2 cov (x;,%,) 
j w 


il) 


Tity w 


w,- w 
= j 
ER 1) r}? z 


(9.16) 


related with the size component 


The shape component ïs then uncor: 
lations 


and this will remain approximately true if the corre 
are nearly equal but not exactly S0. 
imination we may choose 


between two populations 
"ghape". We shall 


When we are interested in discr 
the weights so as best to discriminate 
and hence arrive at an ad hoc measure of 
take 


5 -% (9.17) 
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and shall look for a linear function of size and shape which 
is the best discriminator : 


2C = aO se 12 (9.18) 
This will be given by determining œ so as to maximize 


(X, -%,)? 


var X 


If Dp = eS P, and Dy =Q, -Q,, the suffixes as usual re- 


ferring to the two populations, this requires the maximization 
of 


d 2 

&Dg + Dy) 
eee ee eee a, 
a? var Q + 2a cov (0, P).+ var P 


leading to 


8 Do var P= Dp cov (Q, P) 


a 0 9.19) 
Dp var Q - Do cov (Q, P) í 
But Q and P are uncorrelated. We have also 
? Ey oje &, = %) 
Dp = ——— (44 > 4) 
x, -%, 
CAPE .)2 
$ 1J a2 
iA GEN 
A ~ x, 
X44 -X%o5-K, - Xx 2 
TAn RSE Y (See 
X, -%X, 
ey =o e 


n 
=~ 
R 

| 
x 
na 

M 
I 
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Dy = p Gy — Fe); 
varQ = {1+ -1)r}. 


On substitution for Œ in (9.19) we then find 


1-r 
i i (9.20) 
1+ (p=-1)r 


This, i 
pee the sense we have defined, is the "best" linear 
unction of size and shape. 


et Reverting to the Iris data of Example 9.1 we have the 
r lowing values of the mean when expressed in terms of common 
andard deviation and reduced to common zero means: 


Versicolor Setosa v-s 

1 1.0628 -1.0628 2.1256 

2 —0.9551 0.9551 -1.9102 

3 3.9894 -3.9894 7.9788 

4 3.4426 -3.4426 6.8852 

Sum = Q 7.5397 7.5397 Dp = 15.0794 
Var Q 10.5076 10.5076 (9.21) 


The estimate of size, Q, is simply the sum of the stand— 
ardized variates for each variety. The variance of Q is cal- 
culated from the dispersion matrix reduced by standardization 


to a correlation matrix : 


4 2 3 4 
1 alo 599,513 «8636, 323 «472,011 
2 1. .382,719 .457, 988 
3 J- «705,258 
a Te (9.22) 
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Thus var Q is the sum of the 16 elements in this matrix, 

The weightings for shape are found from (9.21) by divid- 
ing the figures in the last column by 4 (15.0794) and sub- 


tracting unity, and are —0.4362, 71.5067, 1.1165, 0.8264. For 
the estimate of shape we then find 


P = (-0.4362 x 1.0628) + etc. 


8.2747. 
and by using (9.29) again we find 
var P = 3.0912. 


The covariance of Q and P does not exactly vanish and we cal- 
culate it from (9.22) as 0.36162. Thus, substituting in 
(9.19) we find for g 


(15.0794) (3.0912) — (16.5494) (0.36162) 


(16.5494) (10.5076) — (15.0794) (0.36162) 
= 0.2412 
Hence the discriminator is 


X = 0.2412 Q + P. (9.23) 


Example 9.3 
(C. A. Be Smith, 1947) 


The method of the foregoing example is useful in reducing 
the discriminant to a function of two variables only; 
we work on size and shape variables P and Q 
quadratic discriminators without too much ma 


and if 
we can handle 
thematical com- 


A group of 25 normal and 25 Psychotic individuals were 
given certain tests, and for each individual a size and shape 
variable x and y were determined. The results were— 
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25 Normals 25 Psychotics 
2 Hi Zz x y z 
22 6 62 24 38 8 
20 14 36 19 36 -13 
23 9 61 11 43 67 
23 77 6 60 126 
ay? 33 32 —55 
24 66 10 17 252 
23 13 53 3 17 -55 
18 18 28 15 56 -73 
22 16 42 14 43 -562 
19 18 23 20 8 48 
20 aly? 30 8 46 -88 
20 31 2 20 62 —60 
R1 9 51 14 36 -38 
13 13 3 3 12 45 
20 14 36 10 51 -88 
19 15 29 22 22 30 
20 11 42 11 30 41 
18 17 20 6 30 -66 
20 7 50 20 61 -58 
23 6 67 20 43 22 
23 23 33 15 43 -57 
25 4 71 5 53 -107 
23 5 69 10 43 72 
R1 12 43 13 19 -9 
23 7 65 12 4 26 
Totals 520 308 1084 320 910 -1120 
Mean 20.8 12.32 43.4 12.80 36.40 44.8 


156 


Here 2 is a quantity 5x — 2y — 36 to be explained later. 
We have the following quantities: 


Normal Psychotics 

Mean of x 20.80 12-80 

Mean of y 12.32 36.40 

Var x 6.92 36.75 

Var y 40.89 287.92 

Cov (x,y) 5.27 13.92 

a-f 24 24 

S. d of x 2.63 6.06 

S. d of y 6.39 16.97 
Correlation 0.31 0.14 (9.25) 


Let us consider first of all a linear discriminant func— 


tion of x and y. Pooling the dispersions we get for the sup- 
posed dispersion matrix 


x y 
x 21.83 4.33 
y 164.40 (9.26) 


and for the difference of means (Normal — Psychotics): 


x 8.00 


af 24. 08. 


The inverse of (9.26) is proportional to 


164.40 -4.33 


-4.33 21.83 
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The discriminant, from (9.7),is then 


(164.40 X 8.00 + 4.33 X 24.08) (x, — 20.8) 
+ (—4.33 X 8.00 + 21.83 X -24.08) (x2 — 12.80) 


= 141% — 560y — 10,198 
on division by 280 this becomes, nearly enough, 
z = 5x -= 2y- 36. (9.27) 


The values of this function are shown in (9.24). It is seen 
that z is positive for all normals (no errors) and negative 
for all psychoticsexcept four (16% error). The errors of 
classification, as estimated from the data themselves, are 
accordingly not symmetrical and amount to 8% over all. This 
is better than we should do by using x or y alone; for in- 
stance, if we take x 217 to be normal and x £ 16 to be 


psychotic there would be 1 error in classifying the normals 
and 6 for the psychotics. 


We may remark, however, that the variances and covariances 
of x and y are very different for the two types; and we doubt 
whether it is legitimate to suppose that they have a common 
dispersion matrix. If we go back to the approach of (9.5), 
but assume different dispersion matrices % and B, say, the 
logarithm of the likelihood ratio becomes proportional to 


i j 3 (9.28) 
ay Œ, = My yl ey =- Wy) -IBH (x; — hap); Wy) 
We now use the fact that size 


which is a quadratic function. 
l to zero. 


and shape have a correlation which can be put equa 
The expression (9.28) then reduces to a form of type 


(E-k, + n- kp)? (9.29) 


where 


(£ = y V (att = B+?) etc. 
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In our present case we have for the estimates of a, B 


a 
att = = 0.1445 
6.92 
a?? = 0.0244 
= 0.0272 
[Re 5 0.0035 


and the discriminant function becomes 


-0.1173 (x — 22.65)? — 0.0209 (y - 8.29)? + 3.16. (9.30) 


On multiplication by 2 this becomes, nearly enough 


- 23 -8 
AC =) +Ë = )* - 16 (9.31) 


The values given to the 25 normals by this function are -16, 
-12, -16, -14, -7, -16, -15, -6, ~13, -8, -10, 7, -15, 10, -12, 
-10, -13, -7, -14, -16, 7, -15, -16, -14, -16, 


(2 positive, 
error = 8%), 

The values given to the psychotics are 20, 19, 69, 164, 56, 
29, 87, 92, 53, -14, 95, 103, 36, 85, 100, -8, 39, 76, 99, 
41, 64, 146, 75, 14, 5. (2 negative, error = 84). 


It is instructive to plot the data (as Smith does) on a 
diagram and to examine how the points lie in relation to the 
discriminatory lines. 


The significance of a discriminant function 


9.7 We may ask whether a discriminator is "significant". 
Such a question needs a little clarification. We may mean 
that there is a real difference between the populations but 
that they are so close together that a discriminator is not 
very effective; this is measured by the errors of 
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misclassification which, though minimal, may still be large. 
Or we may mean that there is a real difference between the 
Populations but our sample size is not large enough to pro~ 
duce a very reliable discriminator; this is really a matter 
of setting confidence intervals to the function or its coef— 
ficients. Or we may mean that the parents are identical and 
that a discriminant function is illusory. 


9-8 Questions of "significance" in discriminant functions have 
usually been discussed in terms of the last possibility. They 
are not so much tests of the functions as tests of homogeneity 
by the use of the functions. If heterogeneity is found the 
function, ipso facto, is significant in the sense that it dis— 
criminates between real differences in an optimal way (except 
that we use estimators of dispersions and means instead of the 
unknown parent values). But that way may not be very good 
even if it is the best. 


9.9 Suppose our two populations have, in fact, identical means. 
The difference of the means in the discriminator is then 


U, say, = Zati (%, -Zp % 4 -Za (9.32) 


The term z = is the difference of two means, each nor— 
mally distributed? and is therefore distributed like a mean 
about .zero with twice the variance of a single mean if n, = fg- 
It follows that J is distributed as Hotelling's 7?/(2n - 1), 
based on 2n observations. This is equivalent to the distri- 
bution of multiple R in the null case and can be carried out 


by an analysis of variance. 


9.10 The same conclusion is reached from the following approach. 
We now generalize slightly to the case where there are n, 
Members observed in one class and n,- in the other. It is 
readily verified from 9.5 that this does not affect the form 
(9.7) for the discriminant function. We introduce 4 dummy 

for a dependent variate y on a pseudo-regression equation 


= ard (9.33) 
y Zl; be, z) 
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by putting 


n. n 
y= a es (9.34) 
ny tn, ny + ng 


according as the member falls in the first or second class. 
The mean of y is zero and in (9.33) we take 7, as the mean of 
x; over both classes, so that 


gr = e (9.35) 
Nna tna 
Treating (9.35) as a regression we finå 
S E = ; mie -X; 
{y (x x 4) 2 {1,8 (x, -zp lx; z) 
where summation S refers to sample values. 


Now from (9.34) 


n 
S {y (x,-z, 2 E es aes Zk 
y (x, P mae (yj —%;) EE ng (Kj -% 4) 
= 11 Mo = 
e es (9.36) 
We also have 
5 ea ZUGA aens S, (x; ~% Ve; Bey 
oD (x; Xi) (x ~ % 4) 
amS E Gr; =i) 
+S, Œa -ZEE 


where S4, Sp refer to summation over the first and second 
groups; and by use of (9.35) this is reduced to 
£ ni na 


= 3) + x > =X x -X 
fm, +n,) tij n +n ez Fai) Eaj zag): (9.37) 
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Hence we find, putting 


Ke = 3, i) a) (9.378) 
Ny Ne ba = L 
EA (1-5) E = X54) A (n, +n) aj; lis 
1 2 : 
giving 
Ny Ng ij 
L = (1 — K) 2a’) (%,, -—X,,) (9.38) 
: (ny n)a 13 


leading once again to the discriminant function. The constant 
K is given by putting these values in (9.37a) as 


k= et -Ky Zati gp — Fay) Fy — Faq) 
i f (9.39) 
We also find 
non 
S (y2) = ore (9.40) 
The regression analysis may then be written 
Sum of Squares d.f. 
are times 
ny + na 
2l E, Fag) p 
1-21,0@,, - *,,) n t+n,-p-1 
z a TE (9.41) 


Now it does not follow immediately from ordinary ae ee 
theory that this can be tested as & variance analysis. e 
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have put dummy variates for our dependent variable and the 
ordinary theory applies to models where the dependent variable 
is a random variate and the independent variables may be fixed 
(and in particular, dummies). It is, however, a remarkable 
fact that the test still applies as a test of homogeneity. 

The basic reason is a kind of duality which exists in the 
sample space. In ordinary regression theory for the null 
case (no parental multiple correlation) we find the distri- 
bution of the angle between a random (dependent) vector and a 
fixed (independent) plane. On the present occasion we re— 
quired the distribution of the angle between the fixed (depen— 
dent) vector and the random ( independent) plane. And the two 
amount to the same thing. When we leave the null case this 
duality, in general, breaks down. 


Example 9.4 


Let us revert to the Iris data of Example 9.1 and 9.2. 
We found values of L's in the former, but in applying (9.41) 
we have to be careful about ceofficients. Those l's were 


found from the deviance (not the dispersion matrix). Using 
them and (9.7) we find 


zl, (%,, = %, = (-0.031,151,1) (0.930) + (-0.183,907,5) (0.658) 


+ (0.222,104,4) (2.758) + (0.814, 737,0) (1.080) 


1,053,404, 683,2. 
Since ny = nz = 50 we have for K, from (9.39) 


1.053, 404,68 


MIS eae eee 
1.063,404,68 +1/25 


Wt 


0.963, 417 


Thus the analysis is 


5.5. 25 times def. Quotient 

"Regression" 0.963, 417 4 0.240, 854 

Residual 0.036,573 95 0.000, 385 
Total 1.000,000 99 (9.42) 
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and the "regression" is overwhelmingly significant. 


We may, if we wish, test the significance of individual 
coefficients in the discriminant functions by regarding them 
as regression coefficients. We use the matrix inverse to (9.8), 
namely (9.9a). For example the value of l; in Example 9.1 is 
-0.031,151,1. We standardize it by multiplying by 
Ny Ng 
nima 
(9.9) the term corresponding to 1, is 0.118,;716,1, which must 
be multiplied by the residual quotient in (9.42), namely 
+000,384,979 X 25 to give an estimate of the variance of the 
coefficient as .001,142,580 with a standard error of .0336. 
The actual value is rather less than this and we are led to 
doubt whether x, is playing any important part in the dis— 
crimination. 


(1 —X), namely, 0.914,325, to obtain —0.028,482. In 


We may remark that in this example, although the value of 
the discriminant function would be slightly impaired if we 
discarded any variate, except perhaps x4, we can get a dis— 
criminant which is quite good enough for ordinary purposes 
by using x, , petal length, alone. The variance of Xg, from 
(9.8), is estimated as 12.2978/98 = 0.125,488. The mean dif- 
ference (of two sets of 50) has then a variance of «005, 019,52 
with a standard error of .0785. On this scale the discrimin— 
ant function would be x, itself. The mean difference in two 
sets of 50 is then 2.798, about 36 times the standard error; 
not so big as the factor of 50 for the discriminant function 
based on four variates but big enough for practical purposes. 
The error of misclassification is about equal to the 
area of a normal curve to the right of an ordinate fo 
ard deviations to the right of the mean. 


tail 
ur stand- 


The case of k populations 


en two populations 


9.11 When we proceed from discrimination betwe : 
s an essentially 


to discrimination among a member of population: nta 
new point appears. As before, we shall endeavour to divide 
up the sample space into mutually exclusive regions, one for 
for each population, and allot an observed member to the pop- 
ulation in whose region it falls. But the boundaries of the 
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regions are no longer determined by one single discriminant 
function. Hither we must, to achieve optimal properties, 
have several functions, or, if we must have a single function, 
we shall have to sacrifice some discriminatory power. 


9.12 It will be enough for expository purposes if we consider 
three populations — the generalization to k is immediate. 
We will also generalize to the extent of supposing that the 
probabilities of occurrence of the three populations whose 
density functions are f,, f2, fg are respectively Ty, To, Ta 
(4 t To + =1). If the corresponding regions are Ry, 
Ra, Rg, a generalization by Rao of a lemma due in essence to 
Neyman and Pearson states that the errors of misclassification 
are a minimum if the regions are determined by probability 
ratios which form a simple extension of 9.3. In fact R, is 
such that m,f, is greater than or equal to both Tofa and 
Tefa; R, is such that tof, > nafs and ™%f,; R, is such that 
Tafa > Gf, and Tofa. 


9.13 In particular, if the three populations are normal with 
common dispersion matrix %;, and means Uy gs Ung» Ug» it fol- 
lows as in the manner of out that R, must be such that 


wti (o - 2j) x, ? some constant Bin, say; (9.43) 
Zati O - PREJ > some constant P,a, say. (9.44) 


Similarly for the other regions. In the sample R, will be 
determined as the domain lying between the two hyper-planes 
(9.43) and (9.44) and including the mean of population 1; and 
soon. The surfaces of constant weighted probability ratio 
for populations 1 and 2 are, in fact, given by 


f ; 
log i = ati Usj ~Hapey -$ Zaid (y 


Tafa 142; ~ Me glt > 


+ log n/n, (9.45) 


In the particular case where all the T's are 


equal we may 
compare the three functions 


165 


= ij 1 stg 

x, Jati Hj x-5 at M; My (9.46) 
5. oy 1 stj 

x, Zati b; x; -32a Wag Meg (9.47) 
T F ins 

X, = Zati jx- ga lai Haj (9.48) 


and allot a member to R,» Ry R, according to which of the {s 
is the greatest when the sample values are substituted. For 
if, say X, is the greatest, it follows from (9.45) that fı 7 

| fa and fı > fse As usual we may substitute sample values for 
the unknown parameters in these equations to get an approxim— 
ate discriminator. 


Example 9.5 
(Rao and Slater, 1949, Brit. Jour. Psych. 2, 17) 


A number of persons falling into certain neurotic groups 


| obtained the following mean scores in three tests %1, X2, Xa- 
Sample Mean Score 
Group 5 
Size at 2 3 
Anxiety state 114 2.9298 1.1667 0.7281 
Hysteria 33 3.0303 142424 0.5455 
Psychopathy 32 3.8125 1.8438 0.8125 
Obsession 17 4.7059 1.5882 1.1176 
Personality change 5 1.4000 0.2000 0.0000 
Normal 55 0.6000 0.1455 0.2182 
256 (9.49) 
The dispersion matrix within groups (250 dof.) was 
1 2 3 
1 2300, 851 0.261,578 0.474, 169 
2 0.607,466 0. 035,774 


fig 3 0.595,094 (9.50) 
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Its inverse is 


1 2 3 
1 0.543,234 —0.200, 195 —0.420,813 
1.725, 807 0.055, 767 


2.012,357 (9.51) | 


For the purposes of this example I will suppose all the 


T's to be equal. The functions of type (9.46) then are as 
follows : 


Coefficients 

X4 Xo Xg Constant 
Normal 0.2050 0.1431 0.1947 —0. 0931 
Personality change 0.7204 0.0649 —0. 5780 -0.5107 
Anxiety state 1.0515 1.4676 0.2974 2.5047 
Hysteria 1.1678 1.5679 —0.1081 2.7139 
Psychopathy 1.3599 2.4641 0.1836 4.9182 
Obsession 1.7680 1.8611 0.3573 -5.8375 

(9.52) 


Here, for example, the coefficient of x, for the normal state 
is 


(0.543, 234) (0.6000) — (0.200,195) (0.1455) + (—0. 420, 813) (0.2182) 


= 0.2050. 


Suppose, for example, 
O. The values of the func 
0.2550, 0.2746, 0.0144, O2 
the member to the second gr 


we had a subject with scores 1, 1, 
tions, in the order of (9.52) are 
18, —1,0942, -2.2084, We assign 
oup, personality change. In 


the personality—change gro 


up on which the sample discrimin— 
ators are based. 


9.14 Consider again the geometrical representation of the 
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situation in a p-way space. The k populations, assumed with 
identical dispersions, are centered at k points in the space 
and we have been discussing the partitioning of that space 
into k parts. 


Now if the population—means all lie on a straight line 
and the T's are equal all the functions of type (9.46) are 
proportional and we can use any one as a discriminator. We 
can then partition the space by parallel planes; or, looked 
at from a slightly different viewpoint, can measure our dis- 
criminating function along the line of means. In short, we 
reduce the problem to one involving only a single discriminator. 


9.15 In practice it will happen only rarely that this ideal 
situation arises; but as an approximation we might find the 
line of closest fit to the means and use it as one discrimina— 
tor; and if this is not sufficient, find a second orthogonal 
line as a second discriminator; and so on. This, in Done 
of fact, leads us to a method very akin to component analysis. 

Let us suppose that we have k populations of p variates, 
and require to determine the constants in a function 


Neuen 2) l; x; 
such that the ratio of variances between and within classes 
is maximized. This is, in fact, the procedure we have al- 
ready followed for k = 2 in (9,6). It comes to the same 
thing, and is more convenient, to maximize the ratio of s 
Ipetween' to ‘total’ variance. If (A) represents the matrix 
between means and (B) the matrix within classes we shall have 
to maximize 


Pay li lj (9.53) 


which leads to 
(9.54) 
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giving the familiar determinantal form 
|4-Ml| =o (9.55) 


The largest root of this provides our discriminant func- 
tion. We may, in the manner mentioned below in 9.16, pro~ 
ceed to test whether further roots are required for additional 
discriminators. 


Example 9.6 


(M.S. Bartlett, 1951, Ann. Bugen. Lond., 16, 199; E. J. 
Williams, 1952, Biometrika, 39,17) 


Let us reconsider the data of Example 8.4 ¢oncerning yield 
of straw (x,) and grain (x,) after the elimination of block- 
effects. We found for the matrix A between treatments (7 d.f.) 


ak R 
1 12,496.8 -6,786.5 
(9.56) 
32, 985.0 
and for the totals B (56 d.f.) 
ak 2 
al 149,469.4 51,762.4 
(9.57) 
2 104,481.1 


Equation (9.55) then becomes 


12,496.8 — 149,469.4 À 


-6,786.6 — 51,762.4 À| _ 
-6786.6 — 51,762.4 À 


(e) 
32,985.0 — 104,481.1 À 


a quadratic in À with roots (9.58) 


M = 0.47698, À= 0.05034. 
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The l's corresponding to A, are (proportionally) given by 
either of 


58,797.1 1, + 31,477.22 lz (0) 


31,477.2 la + 16,850.41, = 0 


Taking the coefficient of l, as unity we find 1, = -0.535 and 
the discriminant is 


Xo — 0.535 x, (9.59) 


9.16 This aspect of the subject has been carried further by E. 
J. Williams (loc. cit, 1952) who derived an exact test of sig- 
nificance of a discriminant function and by M.S.Bartlett (Joc. 
cit. 1951) who suggested some approximate tests and extended 
Williams' results. The investigations are similar to those 
mentioucd earlier in component analysis; the object is to see 
what is the smallest number of dimensions in which the parent 
means lie. If they are collinear, one discriminator is suf- 
ficient; if they are coplanar two are required; and so on. 
The general treatment thus links up with component and canon- 
ical correlation analysis and our various topics are to be 
seen as different aspects of the same fundamental structure. 


9.17 A few final notes 


(a) We remarked in chapter 8 that concomitant variation 
could be abstracted by a regression technique and a dispersion- 
analysis carried out on the adjusted variables. The same is 
true for discriminatory analysis. A worked exemple is given 
by Cochran and Bliss (1948). As a general rule, however, 
there seems little to gain by eliminating tinternal' variation 
in this way. The eliminated variates might as well be retained 
in this discriminant if they are to be used at all. 


(b) For further discussion of the use of dummy dependent 
variates in constructing analyses of variance see Cochran and 
Bliss (1948); and for the use of ordinary regression—theory 
tests on coefficients in a discriminant function see Bartlett 


(1939a). 
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(c) For discrimination by the D°-statistic see Rao's 
book on Advanced Statistical Methods in Biometric Research. 


(d) Discrimination problems often arise with variates 
which are not measurable, 


e.g. simple dichotomies or classifi- 
cations. 


These are often treated by inserting (0,1) variables 
or (-1,0,1) variables but the method is rather rough. 
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EXERCISES 
1. A p-variate complex has the following correlation matrix: 
1 r td 500 rp-1 
r 1 r one rp -2 
Ta T E acre Pas 
h ` z 500 . 
r?-1 rP-2 rP-3 ate a 


-1 
Show that the determinant of the matrix is (1 — r2)? 
and hence that, apart from the trivial case when r =i, the 
complex cannot be represented in fewer than p dimensions. 


2. The correlations of a variate x, with Xq and x, re— 
spectively are 0.8 and 0.6. Find the correlation between 


%q and x, if the variation of the three variates can be ex- 
pressed in two dimensions. 


3. A p-variate complex has the Correlation matrix : 
l 
1 r r synods r 
r 1 r eo if | 
i 
r r r BU Ho 1 \ 


Show that if r > O it has one greatest characteristic 


root and that all the others are equal. Verify that the sum 
of all the roots is p. 
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4. The three variables (Sweden, 1921 — 1938) 


Xı = log price of lump sugar 
X2 = log animal food price 
Xa = log income per head 


have the correlation matrix 


1.0 631555 —. 867190 
1.0 —. 330909 
1.0 


Show that the three characteristic roots of this matrix are 
2.245,548, .688,417 and «066,036. 


5. Do a centroid analysis on the matrix 
1 4 4 4 
4 1 3 EY 
3 2 1 3 
4 4 4 1 


by reflecting x1, 2 and Xg at the second, third and fourth 
stages respectively and hence derive the factors. 


| JA = V B e, + x2 txa * Xa) 


PMR: + xq + X4) 
fa = = ax, ae Xe X3 4 
hip el + xa + x4) 
=E — 2x2 T Xs 4 
Zo viz : a 
= 1 4 + x4) 
fa = 5 ( xg * x4 
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6. In question 4, regard the expression | r- AI| = f(N 
as a cubic in A and evaluate it for four convenient values of 
A; e-8- 0; 1, 2, 3. Hence find f(\) and solve it to obtain 
the characteristic roots. 


T Do a centroid analysis on the correlation matrix 


by reflecting x4, %2 and x, respectively atsuccessive stages, 
and hence derive the transformation to new variables y which 
are also uncorrelated and have unit variance : 


i 

Of Mech CEST a e oe 2h) 

1 1 2 3 4 
apr (- ax, + + + 

Y2 vi? 1 Xo Xg x4) 

ys = li + + 

3 Vie = AX + xX + x4) 
= i 

Y4 a T % Ea) 


(This is a particular 


case of a well-known transformation known 
as Helmert's.) 


8. A set of observations (x, y) are subject to' independent 
errors with equal variances. If the estimate of a linear 
relation between them based on maximum likelihood is 


y = x tanW and the two regression lines which are obtained 
by regarding one or the other variable as fixed are y= x 
tan 0, x = y tan 02, show that 


2 cot 2b = cot Q- tan 6, 
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9. In the previous exercise, show that the maximum like— 
lihood estimate of the "true" value of a pair (x;, y;) is 
obtained by dropping a perpendicular from that point on to 
the estimated line of linear relationship. 


10. Verify by writing the equations explicitly that (4.39) 
is insufficient for the estimation of the B's when the dis- 
tributions are normal. 


11. A pair of variates yı, Yo are uncorrelated; a set of 
three variates X1, X2, Xa are also uncorrelated. Each of 

the y's is correlated with each of the x's to an equal extent, 
the correlation coefficients being r» By finding the greatest 
canonical correlation show that r cannot exceed 1//6 in ab- 


solute value. 


12. In the problem of k samples from bivariate normal popu- 
lations, show that in the usual notation the criterion for 
testing the hypothesis H, (that the k populations, given 
equal dispersions, have equal means) is 


(ia Peed 
May Birt aie 


| vijo 


an 


where n is the total sample number. 


Show how to test the significance of this criterion. 


13. Four strains of mice are treated with a drug and, after 


a certain period, on each mouse the following variates are 
measured in suitable~units; 

x, = gain in weight 

Aon phate content of blood 
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The sum of squares and cross product about means are as 
follows : 


Strains 

1 2 3 4 
iin Se Ot 
Mean x, 2 3 4 3 
S a, - z,)? 17 16 28 12 
Mean x3 15 18 16 20 
S (x - z,)? 56 36 60 45 
Shes -%,)@,-%,) 24 10 30 10 
Sample number 10 8 14 11 


Perform an analysis of dispersion to test the differences 
of means of the populations. 


14. In the previous exercise use covariance analysis to dis- 


rains (a) according to x, after 


removal of the effect of X2; (b) according to X2 after the 


removal of x,. 


Comment on your results. 


15. Explain what is implied in 


the expression "testing the 
significance of components" 


in component analysis, 


Carry out the 
Example 4 and state 
(n=18), 


Process on the characteristic roots of 
what you Suppose your results to mean. 


16. Taking the first two 


variates of Example 9.1, show that 
a discriminant function is $ 


185 


17. It is desired to discriminate between individuals 

drawn from two populations, qT, and i on the basis of measure- 
ment of some, or all, of the characters %4, X2, Xg. The costs 
of measuring x4, X2 and xg on an individual are respectively 
23., 48s., and 7s. 


The variance — covariance matrix of x,, X2 and %g, which 
is the same for both I], ana IL, is 


xı 4 
Xo 3 9 
Xs 1 o 1 


and the expected values of x,, X2 and xg in ints IL respectively 
are 


The loss involved in the misclassification of an indi- 
vidual may be assessed at 10s. Find the most economical dis- 
criminant function to use if the populations occur with equal 
frequency. It may be assumed that the joint distribution of 


Xas Xo, Xg is multinormal. 
eee 
OG , We, 
VEN 
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